M. H. Bornstein, K. Ferdinandsen, and C. G. Gross, Perception of symmetry in infancy, Developmental Psychology, vol.7, issue.1, pp.1-9

Y. Boykov and M. Jolly, Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images, ICCV, p.36, 2001.

Y. Boykov, O. Veksler, and R. Zabih, Fast approximate energy minimization via graph cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.2, issue.3

W. Brendel and S. Todorovic, Video object segmentation by tracking regions, ICCV, p.61, 2009.

T. Brox and J. Malik, Object segmentation by long term analysis of point trajectories, ECCV, vol.62, p.118, 2010.

T. Brox and J. Malik, Large displacement optical flow: Descriptor matching in variational motion estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.3, issue.3, p.86

S. Caelles, K. M. Pont-tuset, L. Leal-taixé, D. Cremers, and L. Van-gool, One-shot video segmentation, CVPR, vol.52, p.60, 2017.

J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, Semantic segmentation with second-order pooling, ECCV, 2012.

J. Carreira and C. Sminchisescu, CPMC: Automatic object segmentation using constrained parametric min-cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.7, p.18, 2012.

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected CRFs, ICLR, vol.46, p.110, 2015.

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.71, 2017.

M. Cheng, N. J. Mitra, X. Huang, P. Torr, and S. Hu, Global contrast based salient region detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.2, pp.0-1

K. Cho, B. Van-merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk et al., Learning phrase representations using RNN encoderdecoder for statistical machine translation, EMNLP, vol.65, p.66
URL : https://hal.archives-ouvertes.fr/hal-01433235

R. G. Cinbis, J. Verbeek, and C. Schmid, Multi-fold MIL training for weakly supervised object localization, CVPR
URL : https://hal.archives-ouvertes.fr/hal-00975746

J. Dai, K. He, and J. Sun, Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, ICCV, 2015.

A. Dave, O. Russakovsky, and D. Ramanan, Predictive-corrective networks for action detection, CVPR, vol.2, pp.0-1

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the royal statistical society. Series B (methodological, vol.8, pp.1-3, 1920.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, p.67, 2015.

A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Haz?rbas et al., FlowNet: Learning optical flow with convolutional networks, ICCV, vol.69, p.73, 2015.

P. Duygulu, K. Barnard, J. F. De-freitas, and D. A. Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary, ECCV, p.25, 2002.

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, vol.38, p.106

A. Faktor and M. Irani, Video segmentation by non-local consensus voting, BMVC, vol.94, p.96

R. L. Fantz, J. Fagan, and S. B. Miranda, Early visual selectivity. Infant perception: From sensation to cognition, vol.1, pp.9-12

C. Farabet, C. Couprie, L. Najman, and Y. Lecun, Learning hierarchical features for scene labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.3, issue.8, p.31, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00742077

C. Finn, I. Goodfellow, and S. Levine, Unsupervised learning for physical interaction through video prediction, NIPS, vol.2, pp.0-1

M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Communications of the ACM, vol.24, issue.6, p.55, 1981.

K. Fukushima, Cognitron: A self-organizing multilayered neural network, Biological cybernetics, vol.2, pp.1-2

R. Gadde, V. Jampani, and P. V. Gehler, Semantic video cnns through representation warping, ICCV, vol.2, pp.0-1

F. Galasso, M. Keuper, T. Brox, and B. Schiele, Spectral graph reduction for efficient image and streaming video segmentation, CVPR, p.61, 2014.

F. Galasso, N. S. Nagaraja, T. J. Cardenas, T. Brox, and B. Schiele, A unified video segmentation benchmark: Annotation, metrics and analysis, ICCV, vol.2, pp.0-1

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, AISTATS

A. Graves, Generating sequences with recurrent neural networks, vol.2, p.106

A. Graves, N. Jaitly, and A. Mohamed, Hybrid speech recognition with deep bidirectional LSTM, Workshop on Automatic Speech Recognition and Understanding, vol.2, pp.0-1
DOI : 10.1109/asru.2013.6707742

A. Graves, A. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, ICASSP, vol.2, pp.0-1
DOI : 10.1109/icassp.2013.6638947

A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks,1, vol.8, issue.5, p.78, 1961.
DOI : 10.1016/j.neunet.2005.06.042

M. Grundmann, V. Kwatra, M. Han, and I. Essa, Efficient hierarchical graph based video segmentation, CVPR, p.61, 2010.
DOI : 10.1109/cvpr.2010.5539893

URL : http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36247.pdf

U. Güçlü and M. A. Van-gerven, Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream, Journal of Neuroscience, vol.3, issue.2

B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, Semantic contours from inverse detectors, ICCV, vol.39, p.44, 2011.
DOI : 10.1109/iccv.2011.6126343

URL : http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/papers/habmm_iccv2011.pdf

G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra et al., Weakly sup ervised learning of ob ject segmentations from web-scale video, ECCV, 2012.
DOI : 10.1007/978-3-642-33863-2_20

URL : http://www.cs.cmu.edu/~rahuls/pub/eccv2012wk-cp-rahuls.pdf

K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, ICCV,2 0 1 7, vol.116, p.118
DOI : 10.1109/tpami.2018.2844175

K. He, X. Zhang, S. Ren, and J. Sun, Identity mappings in deep residual networks, ECCV, vol.1, p.109, 2016.

S. Hochreiter, Untersuchungen zu dynamischen neuronalen netzen. Diploma, vol.9, pp.1-9

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1-7

S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han, Weakly supervised semantic segmentation using web-crawled videos, CVPR, vol.112, p.115, 0111.

J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. National Academy of Sciences, vol.7, issue.9

B. Horn, Robot vision, p.54, 1986.

J. F. Hughes and J. D. Foley, Computer graphics: principles and practice, p.57, 2014.

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy et al., Flownet 2.0: Evolution of optical flow estimation with deep networks, CVPR, vol.74, p.86, 2017.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML, vol.2, pp.0-1

S. D. Jain, B. Xiong, and K. Grauman, Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos, CVPR, vol.58, p.95, 1952.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe: Convolutional architecture for fast feature embedding, ACM Multimedia

B. Jin, M. V. Ortiz-segovia, and S. Susstrunk, Webly supervised semantic segmentation, CVPR, vol.2, pp.0-1

T. Joachims, Transductive inference for text classification using support vector machines, ICML,1, vol.9

A. Joulin, K. Tang, and L. Fei-fei, Efficient image and video colocalization with Frank-Wolfe algorithm, ECCV, vol.47, p.48

M. Keuper, B. Andres, and T. Brox, Motion trajectory segmentation via minimum cost multicuts, ICCV, vol.63, p.118, 2011.

A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele, Simple does it: Weakly supervised instance and semantic segmentation, CVPR, vol.23, p.24, 2005.

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele, Lucid data dreaming for object tracking, The 2017 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, vol.2, pp.0-1

A. Khoreva, F. Galasso, M. Hein, and B. Schiele, Classifier based graph construction for video segmentation, CVPR, vol.61, p.62, 2015.

A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkinehornung, Learning video object segmentation from static images, CVPR, vol.60, p.95, 1952.

K. Koffka, Principles of Gestalt psychology, Brace Jovanovich, p.62, 1935.

Y. J. Koh and C. Kim, Primary object segmentation in videos based on region augmentation and reduction, CVPR, vol.92, p.93, 2017.

A. Kolesnikov and C. H. Lampert, Seed, expand and constraint: Three principles for weakly-supervised image segmentation, ECCV, vol.104, p.110, 2016.

P. Krähenbühl and V. Koltun, Efficient inference in fully connected CRFs with Gaussian edge potentials, NIPS, vol.23, p.84, 2011.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, 2012.

A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury et al., Ask me anything: Dynamic memory networks for natural language processing, ICML, vol.2, pp.0-1

S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid, Unsupervised object discovery and tracking in video collections, ICCV, vol.32, p.48, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01153017

T. Le, K. Nguyen, M. Nguyen-phan, T. Ton, T. N. et al., Instance re-identification flow for video object segmentation, The 2017 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, vol.2, pp.0-1

Y. Lecun and Y. Bengio, Convolutional networks for images, speech, and time series, vol.3361, p.65, 1995.

Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard et al., Backpropagation applied to handwritten zip code recognition, Neural computation, vol.1, issue.4, p.18, 1989.

Y. J. Lee, J. Kim, and K. Grauman, Key-segments for video object segmentation, ICCV, 1964.

I. Lenz, H. Lee, and A. Saxena, Deep learning for detecting robotic grasps, The International Journal of Robotics Research, vol.3, issue.4-5, p.18, 2015.

J. Lezama, K. Alahari, J. Sivic, and I. Laptev, Track to the future: Spatio-temporal video segmentation with long-range motion cues, CVPR, p.61, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00817961

F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, Video segmentation by tracking many figure-ground segments, ICCV

G. Li and Y. Yu, Deep contrast learning for salient object detection, CVPR, 1928.

X. Li, Y. Qi, Z. Wang, K. Chen, Z. Liu et al., Video object segmentation with re-identification, The 2017 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, p.119, 2017.

X. Li, L. Zhao, L. Wei, M. Yang, F. Wu et al., Deepsaliency: Multi-task deep neural network model for salient object detection, CVPR, 1928.

Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, Fully convolutional instanceaware semantic segmentation, CVPR, vol.116, p.118, 2017.

X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin et al., Towards computational baby learning: A weakly-supervised approach for object detection, ICCV

G. Lin, A. Milan, C. Shen, and I. Reid, Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation, CVPR, 2002.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common objects in context, ECCV, p.71

P. O. Pinheiro and R. Collobert, From image-level to pixel-level labeling with convolutional networks, CVPR, vol.32, p.46, 2015.

P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár, Learning to refine object segments, ECCV, vol.52, p.75, 2016.

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, CVPR, vol.38, p.105, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00695940

M. Ranzato and M. Szummer, Semi-supervised learning of compact document representations with deep networks, ICML, 2002.

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, NIPS, p.52, 2015.

X. Ren and J. Malik, Tracking as repeated figure/ground segmentation, CVPR

J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, EpicFlow: Edge-preserving interpolation of correspondences for optical flow, CVPR, vol.74, p.86, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01142656

B. Romera-paredes and P. H. Torr, Recurrent instance segmentation, ECCV.S p r i n g e r, vol.2, pp.0-1

O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, MICCAI, vol.18, p.69, 2001.

C. Rother, V. Kolmogorov, and A. Blake, Grabcut: Interactive foreground extraction using iterated graph cuts, ACM Trans. Graphics, vol.23, issue.3, p.37, 2004.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol.1, issue.1

O. Russakovsky, Y. Lin, K. Yu, and L. Fei-fei, Object-centric spatial pooling for image classification, ECCV, 1925.

X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong et al., Convolutional LSTM network: A machine learning approach for precipitation nowcasting, NIPS, vol.2, pp.0-1

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre et al., Mastering the game of go with deep neural networks and tree search, Nature, vol.5, issue.7, p.9

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, vol.71, p.84

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, vol.2, pp.0-1

E. S. Spelke, Principles of object perception, Cognitive science, vol.14, issue.1, p.116, 1990.

N. Srivastava, E. Mansimov, and R. Salakhutdinov, Unsupervised learning of video representations using LSTMs, ICML, vol.2, pp.0-1

S. Sukhbaatar, J. Weston, and R. Fergus, End-to-end memory networks, NIPS, vol.2, pp.0-1

N. Sundaram, T. Brox, and K. Keutzer, Dense point trajectories by GPU-accelerated large displacement optical flow, ECCV, vol.62, p.86, 2010.

B. Taylor, V. Karasev, and S. Soatto, Causal video object segmentation from persistence of occlusions, CVPR, vol.57, p.80, 2011.

T. Tieleman, G. Hinton, and . Rmsprop, COURSERA: Lecture 6.5Neural Networks for Machine Learning, p.83, 2012.

P. Tokmakov, K. Alahari, and C. Schmid, Weakly-supervised semantic segmentation using motion cues, ECCV, vol.2, pp.0-1
URL : https://hal.archives-ouvertes.fr/hal-01292794

P. Tokmakov, K. Alahari, and C. Schmid, Learning motion patterns in videos, CVPR, vol.56, p.94, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01427480

P. Tokmakov, K. Alahari, and C. Schmid, Learning video object segmentation with visual memory, ICCV, vol.2, pp.0-1
URL : https://hal.archives-ouvertes.fr/hal-01511145

P. H. Torr, Geometric motion segmentation and mo del selection, Phil. Trans. Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol.54, p.55

Y. Tsai, M. Yang, and M. J. Black, Video segmentation via object flow, CVPR, vol.59, p.60, 2016.

S. Valipour, M. Siam, M. Jagersand, and N. Ray, Recurrent fully convolutional networks for video segmentation

G. Van-horn and P. Perona, The devil is in the tails: Fine-grained classification in the wild, p.117, 2017.

A. Vezhnevets, V. Ferrari, and J. Buhmann, Weakly supervised structured output learning for semantic segmentation, CVPR, p.16, 2012.

P. Voigtlaender and B. Leib, Online adaptation of convolutional neural networks for video object segmentation, BMVC, 1960.

J. Wang, P. Bhat, R. A. Colburn, M. Agrawala, and M. F. Cohen, Interactive video cutout, In ACM Transactions on Graphics (ToG), vol.24, issue.6, pp.585-594, 2005.

W. Wang, J. Shen, and F. Porikli, Saliency-aware geodesic video object segmentation, CVPR,2 0 1 5, vol.63, p.64

Y. Wang, D. Ramanan, and M. Hebert, Learning to model the tail, Advances in Neural Information Processing Systems, p.117, 2017.

L. Wen, D. Du, Z. Lei, S. Z. Li, and M. Yang, Jots: Joint online tracking and segmentation, CVPR, vol.2, pp.0-1

P. J. Werb, Backpropagation through time: What it do es and how to do it, Proc. IEEE,7, vol.8, p.79

J. Wu, Y. Zhao, J. Zhu, S. Luo, and Z. Tu, MILCut: A sweeping line multiple instance learning paradigm for interactive image segmentation, CVPR

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi et al., Google's neural machine translation system: Bridging the gap between human and machine translation, vol.2, pp.0-1

W. Xia, C. Domokos, J. Dong, L. Cheong, and S. Yan, Semantic segmentation without annotating segments, ICCV, vol.5, p.23, 2013.

C. Xu and J. J. Corso, LIBSVX: A supervoxel library and benchmark for early video processing, International Journal of Computer Vision, p.61, 2016.

J. Xu, A. G. Schwing, and R. Urtasun, Tell me what you see and I will show you where it is, CVPR, p.25, 2014.

D. L. Yamins and J. J. Dicarlo, Using goal-driven deep learning models to understand sensory cortex, Nature neuroscience, vol.19, issue.3, pp.356-365, 2016.

D. Zhang, O. Javed, and M. Shah, Video object segmentation through spatially accurate and temporally dense extraction of primary object regions, CVPR, vol.2, pp.0-1

Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia, Semantic object segmentation via detection in weakly labeled video, CVPR, vol.2, pp.0-1

H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, Open vocabulary scene parsing, ICCV, vol.2, pp.0-1

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, CVPR, 2002.

S. Zheng, S. Jayasumana, B. Romera-paredes, V. Vineet, Z. Su et al., Conditional random fields as recurrent neural networks. In ICCV, vol.31, p.46, 2015.

B. Zhou, D. Bau, A. Oliva, and A. Torralba, Interpreting deep visual representations via network dissection, CVPR, vol.2, pp.0-1

B. Zhou, A. Khosla, A. Lapedriza, O. Aude, and A. Torralba, Learning deep features for discriminative localization, CVPR, vol.48, p.115, 2016.

J. Zhu, J. Mao, and A. L. Yuille, Learning from weakly supervised data by the expectation loss svm (e-svm) algorithm, NIPS, vol.19, p.23

X. Zhu, D. Anguelov, and D. Ramanan, Capturing long-tail distributions of object subcategories, CVPR