@. D. Oneata, J. Verbeek, and C. Schmid, The LEAR submission at Thumos 2014, ECCV International Workshop and Competition on Action Recognition with a Large Number of Classes, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01074442

@. H. Wang, D. Oneata, J. Verbeek, and C. Schmid, A robust and efficient video representation for action recognition. ArXiv e-prints, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01145834

R. Bibliography, A. Achanta, K. Shaji, . Smith, P. Lucchi et al., SLIC superpixels compared to state-of-the-art superpixel methods, PAMI, vol.34, issue.11, pp.2274-2282, 2012.

J. Aggarwal and Q. Cai, Human motion analysis: A review, CVIU, vol.73, issue.3, pp.428-440, 1999.

J. Aggarwal and M. Ryoo, Human activity analysis, ACM Computing Surveys, vol.43, issue.3, pp.1-43, 2011.
DOI : 10.1145/1922649.1922653

B. Alexe, T. Deselares, and V. Ferrari, Measuring the Objectness of Image Windows, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.11, pp.2189-2202, 2012.
DOI : 10.1109/TPAMI.2012.28

J. Almazan, A. Gordo, A. Fornés, and E. Valveny, Handwritten Word Spotting with Corrected Attributes, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.130

URL : https://hal.archives-ouvertes.fr/hal-00906787

R. Aly, R. Arandjelovic, K. Chatfield, M. Douze, B. Fernando et al., The AXES submissions at TRECVID 2013, TRECVID Workshop, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00904404

S. An, P. Peursum, W. Liu, and S. Venkatesh, Efficient algorithms for subwindow search in object detection and localization, CVPR, 2009.

R. Arandjelovic and A. Zisserman, Three things everyone should know to improve object retrieval, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6248018

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.370.7498

R. Arandjelovi´carandjelovi´c and A. Zisserman, All about VLAD, CVPR, 2013.

P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, Contour Detection and Hierarchical Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, issue.5, pp.898-916, 2011.
DOI : 10.1109/TPAMI.2010.161

F. R. Bach, Exploring large feature spaces with hierarchical multiple kernel learning, NIPS, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00319660

N. Ballas, Y. Yang, Z. Lan, B. Delezoide, F. Prêteux et al., Space-time robust video representation for action recognition, ICCV, pp.51-53, 2013.

A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson et al., Video in sentences out, Proceedings of the Annual Conference on Uncertainty in Artificial Intelligence, 2012.

H. Bay, A. Ess, T. Tuytelaars, and L. Van-gool, SURF: Speeded up robust features, CVIU, vol.110, issue.3, pp.346-359, 2008.

P. Beaudet, Rotationally invariant image operators, ICPR, 1978.

R. Bellman, Dynamic Programming, 1957.

H. Bilen, V. Namboodiri, and L. Van-gool, Object and Action Classification with Latent Window Parameters, International Journal of Computer Vision, vol.15, issue.4, pp.237-251, 2014.
DOI : 10.1007/s11263-013-0646-8

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, 2005.
DOI : 10.1109/ICCV.2005.28

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.8218

M. Blaschko and C. Lampert, Learning to Localize Objects with Structured Output Regression, ECCV, 2008.
DOI : 10.1007/978-3-540-88682-2_2

A. F. Bobick, Movement, activity and action: the role of knowledge in the perception of motion, Philosophical Transactions of the Royal Society B: Biological Sciences, vol.352, issue.1358, pp.1257-1265, 1358.
DOI : 10.1098/rstb.1997.0108

A. Bosch, A. Zisserman, and X. Munoz, Representing shape with a spatial pyramid kernel, Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR '07, 2007.
DOI : 10.1145/1282280.1282340

Y. Boureau, F. Bach, Y. Lecun, and J. Ponce, Learning mid-level features for recognition, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p.14
DOI : 10.1109/CVPR.2010.5539963

Y. Boureau, J. Ponce, and Y. Lecun, A theoretical analysis of feature pooling in visual recognition, ICML, 2010b. Cited on, p.14

M. Brand, Shadow puppetry, Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
DOI : 10.1109/ICCV.1999.790422

W. Brendel and S. Todorovic, Learning spatio-temporal graphs of human activities, ICCV, 2011.

T. Brox and J. Malik, Object Segmentation by Long Term Analysis of Point Trajectories, ECCV, p.12, 2010.
DOI : 10.1007/978-3-642-15555-0_21

T. Brox and J. Malik, Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, issue.3, pp.93-95, 2011.
DOI : 10.1109/TPAMI.2010.143

Z. Cai, L. Wang, X. Peng, and Y. Qiao, Multi-view Super Vector for Action Recognition, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.83

L. Campbell, D. Becker, A. Azarbayejani, A. Bobick, and A. Pentland, Invariant features for 3-D gesture recognition, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, 1996.
DOI : 10.1109/AFGR.1996.557258

L. W. Campbell and A. F. Bobick, Recognition of human body motion using phase space constraints, Proceedings of IEEE International Conference on Computer Vision, 1995.
DOI : 10.1109/ICCV.1995.466880

L. Cao, Z. Liu, and T. Huang, Cross-dataset action detection, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5539875

L. Cao, Y. Mu, A. Natsev, S. Chang, G. Hua et al., Scene Aligned Pooling for Complex Video Recognition, ECCV, 2012.
DOI : 10.1007/978-3-642-33709-3_49

J. Carreira and C. Sminchisescu, CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.7, pp.1312-1328, 2012.
DOI : 10.1109/TPAMI.2011.231

J. Carreira, R. Caseiroa, J. Batista, and C. Sminchisescu, Semantic Segmentation with Second-Order Pooling, ECCV, 2012.
DOI : 10.1007/978-3-642-33786-4_32

C. Chang and C. Lin, LIBSVM, ACM Transactions on Intelligent Systems and Technology, vol.2, issue.3, pp.1-27, 2011.
DOI : 10.1145/1961189.1961199

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, Procedings of the British Machine Vision Conference 2011, pp.15-33, 2011.
DOI : 10.5244/C.25.76

A. Chen and J. Corso, Propagating multi-class pixel labels throughout video frames, 2010 Western New York Image Processing Workshop, 2010.
DOI : 10.1109/WNYIPW.2010.5649773

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.188.7421

Q. Chen, Z. Song, R. Feris, A. Datta, L. Cao et al., Efficient Maximum Appearance Search for Large-Scale Object Detection, 2013 IEEE Conference on Computer Vision and Pattern Recognition, p.62, 2013.
DOI : 10.1109/CVPR.2013.410

G. Cheng, Y. Wan, A. N. Saudagar, K. Namuduri, and B. P. Buckles, Advances in human action recognition: A survey. ArXiv e-prints, 2015.

M. Cheng, Z. Zhang, W. Lin, and P. Torr, BING: Binarized Normed Gradients for Objectness Estimation at 300fps, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.3286-3293, 2014.
DOI : 10.1109/CVPR.2014.414

K. Church and W. Gale, Poisson mixtures, Natural Language Engineering, vol.none, issue.02, pp.163-190, 1995.
DOI : 10.1002/asi.4630260402

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, Describing Textures in the Wild, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.461

URL : https://hal.archives-ouvertes.fr/hal-01109284

R. Cinbis, J. Verbeek, and C. Schmid, Image categorization using Fisher kernels of non-iid image models, 2012 IEEE Conference on Computer Vision and Pattern Recognition, p.66
DOI : 10.1109/CVPR.2012.6247926

URL : https://hal.archives-ouvertes.fr/hal-00685943

R. Cinbis, J. Verbeek, and C. Schmid, Segmentation Driven Object Detection with Fisher Vectors, 2013 IEEE International Conference on Computer Vision, p.91
DOI : 10.1109/ICCV.2013.369

URL : https://hal.archives-ouvertes.fr/hal-00873134

S. Clinchant, J. Renders, and G. Csurka, Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval, Advances in Multilingual and Multimodal Information Retrieval, 2008.
DOI : 10.1007/978-3-540-85760-0_71

J. Corso, E. Sharon, S. Dube, S. El-saden, U. Sinha et al., Efficient Multilevel Brain Tumor Segmentation With Integrated Bayesian Model Classification, IEEE Transactions on Medical Imaging, vol.27, issue.5, pp.629-640, 2008.
DOI : 10.1109/TMI.2007.912817

C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, pp.273-297, 1995.
DOI : 10.1007/BF00994018

G. Csurka and F. Perronnin, An Efficient Approach to Semantic Segmentation, International Journal of Computer Vision, vol.60, issue.2, pp.198-212, 2011.
DOI : 10.1007/s11263-010-0344-8

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, ECCV Workshop on Statistical Learning in Computer Vision, pp.13-32, 2004.

O. Cula and K. Dana, Compact representation of bidirectional texture functions, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001.
DOI : 10.1109/CVPR.2001.990645

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), p.29
DOI : 10.1109/CVPR.2005.177

URL : https://hal.archives-ouvertes.fr/inria-00548512

T. Darrell and A. Pentland, Space-time gestures, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1993.
DOI : 10.1109/CVPR.1993.341109

P. Das, C. Xu, R. Doell, and J. Corso, A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.340

P. Dollár and C. Zitnick, Structured Forests for Fast Edge Detection, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.231

P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp.11-28, 2005.
DOI : 10.1109/VSPETS.2005.1570899

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, 2009 IEEE 12th International Conference on Computer Vision, pp.22-55, 2009.
DOI : 10.1109/ICCV.2009.5459279

I. Endres and D. Hoiem, Category Independent Object Proposals, ECCV, 2010. Cited on pages 88, 91, and 92
DOI : 10.1007/978-3-642-15555-0_42

C. Fanti, L. Zelnik-manor, and P. Perona, Hybrid Models for Human Motion Recognition, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.179

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, Describing objects by their attributes, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206772

G. Farnebäck, Two-Frame Motion Estimation Based on Polynomial Expansion, Proceedings of the Scandinavian Conference on Image Analysis, p.12, 2003.
DOI : 10.1007/3-540-45103-X_50

P. Felzenszwalb and D. Huttenlocher, Efficient Graph-Based Image Segmentation, International Journal of Computer Vision, vol.59, issue.2, pp.167-181, 2004.
DOI : 10.1023/B:VISI.0000022288.19776.77

P. Felzenszwalb, R. Grishick, D. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, pp.1627-1645, 2010.
DOI : 10.1109/TPAMI.2009.167

V. Ferrari, M. Marin-jimenez, and A. Zisserman, Progressive search space reduction for human pose estimation, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587468

M. A. Fischler and R. Elschlager, The Representation and Matching of Pictorial Structures, IEEE Transactions on Computers, vol.22, issue.1, pp.2267-92, 1973.
DOI : 10.1109/T-C.1973.223602

C. Fowlkes, S. Belongie, F. Chung, and J. Malik, Spectral grouping using the nystrom method, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.26, issue.2, pp.214-225, 2004.
DOI : 10.1109/TPAMI.2004.1262185

K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik, Learning to segment moving objects in videos, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.91-92, 2015.
DOI : 10.1109/CVPR.2015.7299035

A. Gaidon, Z. Harchaoui, and C. Schmid, Actom sequence models for efficient action detection, CVPR 2011, pp.22-56, 2011.
DOI : 10.1109/CVPR.2011.5995646

URL : https://hal.archives-ouvertes.fr/inria-00575217

A. Gaidon, Z. Harchaoui, and C. Schmid, Activity representation with motion hierarchies, International Journal of Computer Vision, vol.10, issue.3, pp.219-238, 2013.
DOI : 10.1007/s11263-013-0677-1

URL : https://hal.archives-ouvertes.fr/hal-00908581

F. Galasso, N. Nagaraja, T. Cardenas, T. Brox, and B. Schiele, A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.438

A. Gandhi, K. Alahari, and C. Jawahar, Decomposing Bag of Words Histograms, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.45

URL : https://hal.archives-ouvertes.fr/hal-00874895

D. M. Gavrila, The Visual Analysis of Human Movement: A Survey, Computer Vision and Image Understanding, vol.73, issue.1, pp.82-98, 1999.
DOI : 10.1006/cviu.1998.0716

E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars, Finegrained categorization by alignments, ICCV, 2013.
DOI : 10.1109/iccv.2013.215

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.643.7151

P. Gehler and S. Nowozin, On feature combination for multiclass object classification, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459169

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.88-91, 2014.
DOI : 10.1109/CVPR.2014.81

R. Girshick, F. Iandola, T. Darrell, and J. Malik, Deformable part models are convolutional neural networks, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298641

M. G. ¨-onen and E. Alpayd?n, Multiple kernel learning algorithms, JMLR, vol.12, pp.2211-2268, 2011.

L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as Space-Time Shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.12, pp.2247-2253, 2007.
DOI : 10.1109/TPAMI.2007.70711

M. Grundmann, V. Kwatra, M. Han, and I. Essa, Efficient hierarchical graph-based video segmentation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p.89, 2010.
DOI : 10.1109/CVPR.2010.5539893

A. Gupta, A. Kembhavi, and L. Davis, Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, issue.10, pp.311775-1789, 2009.
DOI : 10.1109/TPAMI.2009.83

A. Habibian, K. E. Van-de-sande, and C. G. Snoek, Recommendations for video event recognition using concept vocabularies, Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, ICMR '13, 2013.
DOI : 10.1145/2461466.2461482

C. Harris and M. Stephens, A Combined Corner and Edge Detector, Procedings of the Alvey Vision Conference 1988, 1988.
DOI : 10.5244/C.2.23

H. Harzallah, F. Jurie, and C. Schmid, Combining efficient object localization and image classification, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459257

URL : https://hal.archives-ouvertes.fr/inria-00439516

D. Hogg, Model-based vision: a program to see a walking person, Image and Vision Computing, vol.1, issue.1, pp.5-20, 1983.
DOI : 10.1016/0262-8856(83)90003-3

S. J. Hwang, K. Grauman, and F. Sha, Semantic kernel forests from multiple taxonomies, NIPS, pp.1718-1726, 2012.

S. Intille and A. Bobick, Representation and visual recognition of complex, multi-agent actions using belief networks, 1998.

H. Izadinia and M. Shah, Recognizing Complex Events Using Large Margin Joint Low-Level Event Model, ECCV, 2012.
DOI : 10.1007/978-3-642-33765-9_31

T. Jaakkola and D. Haussler, Exploiting generative models in discriminative classifiers, NIPS, pp.15-34, 1999.

M. Jain, H. Jégou, and P. Bouthemy, Better Exploiting Motion for Better Action Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.52-80
DOI : 10.1109/CVPR.2013.330

URL : https://hal.archives-ouvertes.fr/hal-00813014

M. Jain, J. Van-gemert, P. Bouthemy, H. Jégou, and C. Snoek, Action Localization with Tubelets from Motion, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.22-91, 2014.
DOI : 10.1109/CVPR.2014.100

URL : https://hal.archives-ouvertes.fr/hal-00996844

H. Jégou, M. Douze, and C. Schmid, On the burstiness of visual elements, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206609

H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez et al., Aggregating Local Image Descriptors into Compact Codes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.9, pp.1704-1716, 2012.
DOI : 10.1109/TPAMI.2011.235

S. Ji, W. Xu, M. Yang, and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.1, pp.221-231, 2013.
DOI : 10.1109/TPAMI.2012.59

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.169.4046

Y. Jia, Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, 2013.
DOI : 10.1145/2647868.2654889

Y. Jiang, Q. Dai, X. Xue, W. Liu, and C. Ngo, Trajectory-Based Modeling of Human Actions with Motion Reference Points, ECCV, p.50, 2012.
DOI : 10.1007/978-3-642-33715-4_31

Y. Jiang, S. Bhattacharya, S. Chang, and M. Shah, High-level event recognition in unconstrained videos, International Journal of Multimedia Information Retrieval, vol.73, issue.2, pp.73-101, 2013.
DOI : 10.1007/s13735-012-0024-2

Y. Jiang, J. Liu, A. Zamir, I. Laptev, M. Piccardi et al., THUMOS challenge: Action recognition with a large number of classes, pp.43-53

Y. Jiang, J. Liu, A. Zamir, G. Toderici, I. Laptev et al., THUMOS challenge: Action recognition with a large number of classes

G. Johansson, Visual perception of biological motion and a model for its analysis, Perception & Psychophysics, vol.4, issue.2, pp.201-211, 1973.
DOI : 10.3758/BF03212378

B. Julesz, Textons, the elements of texture perception, and their interactions, Nature, vol.32, issue.5802, pp.91-97, 1981.
DOI : 10.1038/290091a0

S. Karaman, L. Seidenari, A. D. Bagdanov, and A. D. Bimbo, L1- regularized logistic regression stacking and transductive CRF smoothing for action recognition in video, ICCV Workshop on Action Recognition with a Large Number of Classes, pp.52-53, 2013.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.223

S. M. Katz, Distribution of content words and phrases in text and language modelling, Natural Language Engineering, vol.2, issue.1, pp.15-59, 1996.
DOI : 10.1017/S1351324996001246

I. Kim, S. Oh, A. Vahdat, K. Cannons, A. Perera et al., Segmental multi-way local pooling for video recognition, Proceedings of the 21st ACM international conference on Multimedia, MM '13, 2013.
DOI : 10.1145/2502081.2502167

A. Kläser, M. Marsza?ek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, p.11, 2008.
DOI : 10.5244/C.22.99

A. Kläser, M. Marsza?ek, C. Schmid, and A. Zisserman, Human Focused Action Localization in Video, ECCV Workshop on Sign, Gesture, and Activity, pp.22-44, 2010.
DOI : 10.1007/978-3-642-35749-7_17

B. Klein, G. Lev, G. Sadeh, and L. Wolf, Fisher vectors derived from hybrid Gaussian-Laplacian mixture models for image annotation ArXiv eprints, 2014.

A. Kojima, T. Tamura, and K. Fukunaga, Natural language description of human activities from video images based on concept hierarchy of actions, International Journal of Computer Vision, vol.50, issue.2, pp.171-184, 2002.
DOI : 10.1023/A:1020346032608

P. Krähenbkrähenb¨krähenbühl and V. Koltun, Geodesic object proposals, ECCV, 2014. Cited on pages 26

J. Krapac, J. Verbeek, and F. Jurie, Modeling spatial layout with fisher vectors for image categorization, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126406

URL : https://hal.archives-ouvertes.fr/inria-00612277

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, 2012. Cited on, p.17

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, pp.42-51, 2011.
DOI : 10.1109/ICCV.2011.6126543

L. Lamel and J. Gauvain, Speech Processing for Audio Indexing, Advances in Natural Language Processing, 2008.
DOI : 10.1109/TSA.1996.481450

C. Lampert, M. Blaschko, and T. Hofmann, Efficient Subwindow Search: A Branch and Bound Framework for Object Localization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, issue.12, pp.31-2129, 2009.
DOI : 10.1109/TPAMI.2009.144

C. Lampert, H. Nickisch, and S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206594

T. Lan, Y. Wang, and G. Mori, Discriminative figure-centric models for joint action localization and recognition, ICCV, p.22, 2011.

I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, vol.17, issue.8, pp.107-123, 2005.
DOI : 10.1007/s11263-005-1838-7

I. Laptev and P. Pérez, Retrieving actions in movies, 2007 IEEE 11th International Conference on Computer Vision, pp.27-56, 2007.
DOI : 10.1109/ICCV.2007.4409105

I. Laptev, M. Marsza?ek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.29-35, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 2 (CVPR'06), 2006.
DOI : 10.1109/CVPR.2006.68

URL : https://hal.archives-ouvertes.fr/inria-00548585

Q. Le, W. Zou, S. Yeung, and A. Ng, Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis, CVPR, 2011.

T. Leung and J. Malik, Representing and recognizing the visual appearance of materials using three-dimensional textons, International Journal of Computer Vision, vol.43, issue.1, pp.29-44, 2001.
DOI : 10.1023/A:1011126920638

W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos, Dynamic Pooling for Complex Event Recognition, 2013 IEEE International Conference on Computer Vision, pp.52-58
DOI : 10.1109/ICCV.2013.339

Z. Li, E. Gavves, K. Van-de-sande, C. Snoek, and A. Smeulders, Codemaps - Segment, Classify and Search Objects Locally, 2013 IEEE International Conference on Computer Vision, pp.88-90
DOI : 10.1109/ICCV.2013.454

J. Liu, J. Luo, and M. Shah, Recognizing realistic actions from videos, CVPR, pp.11-52, 2009.

J. Liu, B. Kuipers, and S. Savarese, Recognizing human actions by attributes, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995353

L. Liu, C. Shen, L. Wang, A. Van-den-hengel, and C. Wang, Encoding high dimensional local features by sparse coding based Fisher vectors, NIPS, 2014.

D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, issue.2, pp.91-110, 2004.
DOI : 10.1023/B:VISI.0000029664.99615.94

J. Lu, H. Yang, D. Min, and M. Do, Patch Match Filter: Efficient Edge-Aware Filtering Meets Randomized Search for Fast Correspondence Field Estimation, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.242

B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, IJCAI, pp.674-679

S. Ma, J. Zhang, N. Ikizler-cinbis, and S. Sclaroff, Action Recognition and Localization by Hierarchical Space-Time Segments, 2013 IEEE International Conference on Computer Vision, pp.51-52, 2013.
DOI : 10.1109/ICCV.2013.341

T. Ma and L. Latecki, Maximum weight cliques with mutex constraints for video object segmentation, CVPR, 2012.

Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe et al., Complex Event Detection via Multi-source Video Attributes, 2013 IEEE Conference on Computer Vision and Pattern Recognition, p.21
DOI : 10.1109/CVPR.2013.339

J. Malik, S. Belongie, J. Shi, and T. Leung, Textons, contours and regions: cue integration in image segmentation, Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
DOI : 10.1109/ICCV.1999.790346

S. Manen, M. Guillaumin, and L. Van-gool, Prime Object Proposals with Randomized Prim's Algorithm, 2013 IEEE International Conference on Computer Vision, pp.91-96
DOI : 10.1109/ICCV.2013.315

D. Marr and H. K. Nishihara, Representation and Recognition of the Spatial Organization of Three-Dimensional Shapes, Proceedings of the Royal Society B: Biological Sciences, vol.200, issue.1140, pp.269-294, 1140.
DOI : 10.1098/rspb.1978.0020

M. Marsza?ek, I. Laptev, and C. Schmid, Actions in context, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.42-113, 2009.
DOI : 10.1109/CVPR.2009.5206557

S. Mathe and C. Sminchisescu, Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition, ECCV, pp.50-52, 2012.
DOI : 10.1007/978-3-642-33709-3_60

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.423.5629

P. Matikainen, M. Hebert, and R. Sukthankar, Representing Pairwise Spatial and Temporal Relations for Action Recognition, ECCV, 2010.
DOI : 10.1007/978-3-642-15549-9_37

P. K. Matikainen, M. Hebert, and R. Sukthankar, Trajectons: Action recognition through the motion analysis of tracked features, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp.11-12, 2009.
DOI : 10.1109/ICCVW.2009.5457659

M. Mazloom, E. Gavves, and C. Snoek, Conceptlets: Selective Semantics for Classifying Video Events, IEEE Transactions on Multimedia, vol.16, issue.8, pp.2214-2228, 2014.
DOI : 10.1109/TMM.2014.2359771

S. Mccann and D. Lowe, Spatially Local Coding for Object Recognition, ACCV, 2012.
DOI : 10.1007/978-3-642-37331-2_16

M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev, Semantic Model Vectors for Complex Video Event Recognition, IEEE Transactions on Multimedia, vol.14, issue.1, pp.88-101, 2012.
DOI : 10.1109/TMM.2011.2168948

R. Messing, C. Pal, and H. Kautz, Activity recognition using the velocity histories of tracked keypoints, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459154

T. B. Moeslund and E. Granum, A Survey of Computer Vision-Based Human Motion Capture, Computer Vision and Image Understanding, vol.81, issue.3, pp.231-268, 2001.
DOI : 10.1006/cviu.2000.0897

T. B. Moeslund, A. Hilton, and V. Krger, A survey of advances in visionbased human motion capture and analysis, CVIU, vol.104, issue.23, pp.90-126, 2006.

O. R. Murthy and R. Goecke, Ordered Trajectories for Large Scale Human Action Recognition, 2013 IEEE International Conference on Computer Vision Workshops, p.53, 2013.
DOI : 10.1109/ICCVW.2013.61

O. R. Murthy and R. Goecke, Combined ordered and improved trajectories for large scale human action recognition, ICCV Workshop on Action Recognition with a Large Number of Classes, pp.52-53, 2013.

G. K. Myers, R. Nallapati, J. Van-hout, S. Pancoast, R. Nevatia et al., Evaluating multimedia features and fusion for example-based event detection, Machine Vision and Applications, pp.17-32, 2014.

P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis et al., Multimodal feature fusion for robust event detection in web videos, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6247814

J. Niebles, C. Chen, and L. Fei-fei, Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification, ECCV, pp.42-43, 2010.
DOI : 10.1007/978-3-642-15552-9_29

S. Niyogi and E. Adelson, Analyzing and recognizing walking figures in XYT, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition CVPR-94, 1994.
DOI : 10.1109/CVPR.1994.323868

A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International Journal of Computer Vision, vol.42, issue.3, pp.145-175, 2001.
DOI : 10.1023/A:1011139631724

D. Oneata, J. Verbeek, and C. Schmid, Action and Event Recognition with Fisher Vectors on a Compact Feature Set, 2013 IEEE International Conference on Computer Vision, pp.81-82
DOI : 10.1109/ICCV.2013.228

URL : https://hal.archives-ouvertes.fr/hal-00873662

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid, Spatio-temporal Object Detection Proposals, ECCV, 2014.
DOI : 10.1007/978-3-319-10578-9_48

URL : https://hal.archives-ouvertes.fr/hal-01021902

D. Oneata, J. Verbeek, and C. Schmid, Efficient Action Localization with Approximately Normalized Fisher Vectors, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.326

URL : https://hal.archives-ouvertes.fr/hal-00979594

P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders et al., TRECVID 2012 ? an overview of the goals, tasks, data, evaluation mechanisms and metrics, Proceedings of TRECVID, pp.28-44, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00953826

A. Papazoglou and V. Ferrari, Fast Object Segmentation in Unconstrained Video, 2013 IEEE International Conference on Computer Vision, pp.115-116
DOI : 10.1109/ICCV.2013.223

S. Paris and F. Durand, A Topological Approach to Hierarchical Segmentation using Mean Shift, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.89-101, 2007.
DOI : 10.1109/CVPR.2007.383228

A. Patron-perez, M. Marszalek, A. Zisserman, and I. Reid, High Five: Recognising human interactions in TV shows, BMVC, p.11, 2010.

O. Pele and M. Werman, Fast and robust Earth Mover's Distances, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459199

X. Peng, L. Wang, Z. Cai, Y. Qiao, and Q. Peng, Hybrid super vector with improved dense trajectories for action recognition, ICCV Workshop on Action Recognition with a Large Number of Classes, 2013.

X. Peng, Y. Qiao, and Q. Peng, Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics, ECCV, 2014. Cited on, p.16
DOI : 10.1007/978-3-319-10578-9_43

X. Peng, L. Wang, X. Wang, and Y. Qiao, Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice, Computer Vision and Image Understanding, vol.150, p.15, 2014.
DOI : 10.1016/j.cviu.2016.03.013

X. Peng, C. Zou, Y. Qiao, and Q. Peng, Action Recognition with Stacked Fisher Vectors, ECCV, 2014. Cited on, p.16
DOI : 10.1007/978-3-319-10602-1_38

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.34-64, 2007.
DOI : 10.1109/CVPR.2007.383266

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, ECCV, pp.14-62, 2010.
DOI : 10.1007/978-3-642-15561-1_11

URL : https://hal.archives-ouvertes.fr/inria-00548630

P. J. Phillips and A. J. O-'toole, Comparison of human and computer performance across face recognition experiments, Image and Vision Computing, vol.32, issue.1, pp.74-85, 2014.
DOI : 10.1016/j.imavis.2013.12.002

R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing, vol.28, issue.6, pp.976-990, 2010.
DOI : 10.1016/j.imavis.2009.11.014

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, p.100, 2012.
DOI : 10.1109/CVPR.2012.6248065

URL : https://hal.archives-ouvertes.fr/hal-00695940

A. Prest, C. Schmid, and V. Ferrari, Weakly Supervised Learning of Interactions between Humans and Objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.3, pp.601-614, 2012.
DOI : 10.1109/TPAMI.2011.158

URL : https://hal.archives-ouvertes.fr/inria-00516477

A. Prest, V. Ferrari, and C. Schmid, Explicit Modeling of Human-Object Interactions in Realistic Videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.4, pp.835-848, 2013.
DOI : 10.1109/TPAMI.2012.175

URL : https://hal.archives-ouvertes.fr/hal-00720847

L. Rabiner and B. Juang, Fundamentals of Speech Recognition, 1993.

L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, pp.257-286, 1989.

C. Rao, A. Yilmaz, and M. Shah, View-invariant representation and recognition of actions, International Journal of Computer Vision, vol.50, issue.2, pp.203-226, 2002.
DOI : 10.1023/A:1020350100748

K. Reddy and M. Shah, Recognizing 50 human action categories of web videos. Machine Vision and Applications, pp.971-981, 2013.

L. G. Roberts, Machine perception of three-dimensional solids, 1963.

M. Rodriguez, J. Ahmed, and M. Shah, Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587727

K. Rohr, Towards model-based recognition of human movements in image sequences, CVGIP: Image Understanding, vol.59, issue.1, pp.94-115, 1994.

M. Rohrbach, Q. Wei, I. Titov, S. Thater, M. Pinkal et al., Translating Video Content to Natural Language Descriptions, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.61

C. Rother, V. Kolmogorov, and A. Blake, "GrabCut", ACM Transactions on Graphics, vol.23, issue.3, pp.309-314, 2004.
DOI : 10.1145/1015706.1015720

S. Sadanand and J. J. Corso, Action bank: A high-level representation of activity in video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6247806

H. Sakoe and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.26, issue.1, pp.43-49, 1978.
DOI : 10.1109/TASSP.1978.1163055

J. Sánchez, F. Perronnin, and T. De-campos, Modeling the spatial layout of images beyond spatial pyramids, Pattern Recognition Letters, vol.33, issue.16, pp.2216-2223, 2012.
DOI : 10.1016/j.patrec.2012.07.019

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image Classification with the Fisher Vector: Theory and Practice, International Journal of Computer Vision, vol.73, issue.2, pp.222-245, 2013.
DOI : 10.1007/s11263-013-0636-x

P. Sand and S. Teller, Particle Video: Long-Range Motion Estimation Using Point Trajectories, International Journal of Computer Vision, vol.30, issue.3, pp.72-91, 2008.
DOI : 10.1007/s11263-008-0136-6

M. Sapienza, F. Cuzzolin, and P. Torr, Learning discriminative space-time actions from weakly labelled videos, BMVC, 2012.

S. Satkin and M. Hebert, Modeling the Temporal Extent of Actions, ECCV, 2010.
DOI : 10.1007/978-3-642-15549-9_39

C. Schmid, Constructing models for content-based image retrieval, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001.
DOI : 10.1109/CVPR.2001.990922

URL : https://hal.archives-ouvertes.fr/inria-00548274

C. Schüldtsch¨schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, ICPR, 2004.

P. Scovanner, S. Ali, and M. Shah, A 3-dimensional sift descriptor and its application to action recognition, Proceedings of the 15th international conference on Multimedia , MULTIMEDIA '07, 2007.
DOI : 10.1145/1291233.1291311

T. Serre, L. Wolf, and T. Poggio, Object Recognition with Features Inspired by Visual Cortex, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.254

E. Shechtman and M. Irani, Matching Local Self-Similarities across Images and Videos, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383198

Z. Shi, T. Hospedales, and T. Xiang, Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.371

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014.

K. Simonyan, O. Parkhi, A. Vedaldi, and A. Zisserman, Fisher Vector Faces in the Wild, Procedings of the British Machine Vision Conference 2013, p.34
DOI : 10.5244/C.27.8

K. Simonyan, A. Vedaldi, and A. Zisserman, Deep Fisher networks for large-scale image classification, NIPS, 2013b. Cited on, p.16

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, 2003.
DOI : 10.1109/ICCV.2003.1238663

Y. Song, L. Goncalves, and P. Perona, Unsupervised learning of human motion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.25, issue.7, pp.814-827, 2003.
DOI : 10.1109/TPAMI.2003.1206511

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, pp.43-53, 2012.

T. E. Starner and A. Pentland, Visual recognition of american sign language using hidden Markov models, International Symposium on Computer Vision, 1995.

C. Sun and R. Nevatia, ACTIVE: Activity Concept Transitions in Video Event Classification, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.453

J. Sun, X. Wu, S. Yan, L. Cheong, T. Chua et al., Hierarchical spatio-temporal context modeling for action recognition, CVPR, 2009.

N. Sundaram, T. Brox, and K. Keutzer, Dense point trajectories by GPUaccelerated large displacement optical flow, ECCV, 2010.

V. Sydorov, M. Sakurada, and C. Lampert, Deep Fisher Kernels -- End to End Learning of the Fisher Kernel GMM Parameters, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.182

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.220

K. Tang, L. Fei-fei, and D. Koller, Learning latent temporal structure for complex event detection, 2012 IEEE Conference on Computer Vision and Pattern Recognition, p.44
DOI : 10.1109/CVPR.2012.6247808

K. Tang, B. Yao, L. Fei-fei, and D. Koller, Combining the Right Features for Complex Event Recognition, 2013 IEEE International Conference on Computer Vision, p.19, 2013.
DOI : 10.1109/ICCV.2013.335

M. Tao, J. Bai, P. Kohli, and S. Paris, SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithm, Computer Graphics Forum, 2012.
DOI : 10.1111/j.1467-8659.2012.03013.x

E. Taralova, F. De-la-torre, and M. Hebert, Motion Words for Videos, ECCV, 2014. Cited on, p.16
DOI : 10.1007/978-3-319-10590-1_47

G. Taylor, R. Fergus, Y. Lecun, and C. Bregler, Convolutional Learning of Spatio-temporal Features, ECCV, 2010.
DOI : 10.1007/978-3-642-15567-3_11

Y. Tian, R. Sukthankar, and M. Shah, Spatio-temporal deformable part models for action detection, CVPR, pp.22-23, 2013.

D. Tran and J. Yuan, Optimal spatio-temporal path discovery for video event detection, CVPR 2011, p.22, 2011.
DOI : 10.1109/CVPR.2011.5995416

D. Tran and J. Yuan, Max-margin structured output regression for spatiotemporal action localization, NIPS, pp.350-358, 2012.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, C3D: Generic features for video analysis. ArXiv e-prints, 2014.

H. Uemura, S. Ishikawa, and K. Mikolajczyk, Feature Tracking and Motion Compensation for Action Recognition, Procedings of the British Machine Vision Conference 2008, 2008.
DOI : 10.5244/C.22.30

J. Uijlings, K. Van-de-sande, T. Gevers, and A. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.154-171, 2013.
DOI : 10.1007/s11263-013-0620-5

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.361.3382

A. Vahdat and G. Mori, Handling Uncertain Tags in Visual Recognition, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.462

A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim, Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach, 2013 IEEE International Conference on Computer Vision
DOI : 10.1109/ICCV.2013.463

K. Van-de-sande, T. Gevers, and C. Snoek, Evaluating Color Descriptors for Object and Scene Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, pp.1582-1596, 2010.
DOI : 10.1109/TPAMI.2009.154

K. Van-de-sande, J. Uijlings, T. Gevers, and A. Smeulders, Segmentation as selective search for object recognition, 2011 International Conference on Computer Vision, p.25, 2011.
DOI : 10.1109/ICCV.2011.6126456

K. Van-de-sande, C. Snoek, and A. Smeulders, Fisher and VLAD with FLAIR, CVPR, pp.88-90, 2014.

M. Van-den-bergh, G. Roig, X. Boix, S. Manen, and L. Van-gool, Online video SEEDS for temporal window objectness, ICCV, pp.89-91, 2013.

L. Van-der-maaten, Learning discriminative Fisher kernels, ICML, 2011.

M. Varma and A. Zisserman, Classifying Images of Materials: Achieving Viewpoint and Illumination Independence, ICCV, 2002.
DOI : 10.1007/3-540-47977-5_17

A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, Multiple kernels for object detection, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459183

P. Viola and M. Jones, Robust Real-Time Face Detection, International Journal of Computer Vision, vol.57, issue.2, pp.137-154, 2004.
DOI : 10.1023/B:VISI.0000013087.49260.fb

F. Wang, Z. Suny, D. Zhang, and C. Ngo, Semantic indexing and multimedia event detection: ECNU at TRECVID 2012, TRECVID Workshop, 2012.

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, pp.12-31, 2013.
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

H. Wang, M. Ullah, A. Kläser, I. Laptev, and C. Schmid, Evaluation of local spatio-temporal features for action recognition, Procedings of the British Machine Vision Conference 2009, 2009.
DOI : 10.5244/C.23.124

URL : https://hal.archives-ouvertes.fr/inria-00439769

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense Trajectories and Motion Boundary Descriptors for Action Recognition, International Journal of Computer Vision, vol.73, issue.2, pp.60-79
DOI : 10.1007/s11263-012-0594-8

URL : https://hal.archives-ouvertes.fr/hal-00725627

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang et al., Locality-constrained Linear Coding for image classification, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5540018

L. Wang, Y. Qiao, and X. Tang, Mining Motion Atoms and Phrases for Complex Action Recognition, 2013 IEEE International Conference on Computer Vision, pp.2680-2687, 2013.
DOI : 10.1109/ICCV.2013.333

L. Wang, Y. Qiao, and X. Tang, Latent Hierarchical Model of Temporal Structure for Complex Activity Classification, IEEE Transactions on Image Processing, vol.23, issue.2, pp.810-822, 2014.
DOI : 10.1109/TIP.2013.2295753

L. Wang, Y. Qiao, and X. Tang, Video Action Detection with Relational Dynamic-Poselets, ECCV, 2014b. Cited on, p.22
DOI : 10.1007/978-3-319-10602-1_37

X. Wang, L. Wang, and Y. Qiao, A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition, ACCV, 2012b. Cited on, p.15
DOI : 10.1007/978-3-642-37431-9_44

X. Wang, M. Yang, S. Zhu, and Y. Lin, Regionlets for generic object detection, ICCV, 2013c. Cited on, p.91
DOI : 10.1109/tpami.2015.2389830

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.407.4464

O. Weber, Y. Devir, A. Bronstein, M. Bronstein, and R. Kimmel, Parallel algorithms for approximation of distance maps on parametric surfaces, ACM Transactions on Graphics, vol.27, issue.4, 2008.
DOI : 10.1145/1409625.1409626

D. Weinland, R. Ronfard, and E. Boyer, A survey of vision-based methods for action representation, segmentation and recognition, Computer Vision and Image Understanding, vol.115, issue.2, pp.224-241, 2011.
DOI : 10.1016/j.cviu.2010.10.002

URL : https://hal.archives-ouvertes.fr/inria-00459653

G. Willems, T. Tuytelaars, and L. Van-gool, An efficient dense and scaleinvariant spatio-temporal interest point detector, ECCV, 2008.

G. Willems, J. Becker, T. Tuytelaars, and L. Van-gool, Exemplar-based Action Recognition in Video, Procedings of the British Machine Vision Conference 2009, 2009.
DOI : 10.5244/C.23.90

A. Wilson and A. Bobick, Learning visual behavior for gesture analysis, Proceedings of International Symposium on Computer Vision, ISCV, 1995.
DOI : 10.1109/ISCV.1995.477006

J. Winn, A. Criminisi, and T. Minka, Object categorization by learned universal visual dictionary, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, p.66
DOI : 10.1109/ICCV.2005.171

J. Wu, Y. Zhang, and W. Lin, Towards Good Practices for Action Video Encoding, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.330

S. Wu, O. Oreifej, and M. Shah, Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126397

C. Xu and J. Corso, Evaluation of super-voxel methods for early video processing, CVPR, 2012. Cited on pages 26, p.99

C. Xu, C. Xiong, and J. Corso, Streaming Hierarchical Video Segmentation, ECCV, pp.89-114, 2012.
DOI : 10.1007/978-3-642-33783-3_45

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.298.7791

C. Xu, S. Whitt, and J. Corso, Flattening Supervoxel Hierarchies by the Uniform Entropy Slice, 2013 IEEE International Conference on Computer Vision, pp.90-115, 2013.
DOI : 10.1109/ICCV.2013.279

J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in timesequential images using hidden Markov model, CVPR, 1992.

J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid matching using sparse coding for image classification, CVPR, 2009.

X. Yang and Y. Tian, Action Recognition Using Super Sparse Coding Vector with Spatio-temporal Awareness, ECCV, pp.16-21, 2014.
DOI : 10.1007/978-3-319-10605-2_47

L. Yeffet and L. Wolf, Local Trinary Patterns for human action recognition, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459201

G. Yu, J. Yuan, and Z. Liu, Propagative Hough voting for human activity recognition, ECCV, 2012. Cited on, p.52

J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, CVPR, pp.22-90, 2009.

J. Yue-hei, M. Ng, S. Hausknecht, O. Vijayanarasimhan, R. Vinyals et al., Beyond short snippets: Deep networks for video classification ArXiv e-prints, 2015.

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_53

D. Zhang, O. Javed, and M. Shah, Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.87

J. Zheng, Z. Jiang, R. Chellappa, and J. P. Phillips, Submodular attribute selection for action recognition in video, NIPS, 2014.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Object detectors emerge in deep scene CNNs. ArXiv e-prints, 2014.

X. Zhou, K. Yu, T. Zhang, and T. S. Huang, Image Classification Using Super-Vector Coding of Local Image Descriptors, ECCV, 2010.
DOI : 10.1007/978-3-642-15555-0_11

J. Zhu, B. Wang, X. Yang, W. Zhang, and Z. Tu, Action Recognition with Actons, 2013 IEEE International Conference on Computer Vision, p.50, 2013.
DOI : 10.1109/ICCV.2013.442

C. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, ECCV, pp.25-88, 2014.
DOI : 10.1007/978-3-319-10602-1_26