. Pour-construire-ce-jeu and . Uniprot, MC11] avec une requête cherchant les capsides et nucleocapsides (type de capside pour lequel la capside est directement reliée au matériel génétique) Nous avons ensuite appliqué un ltre taxonomique pour ne garder que les virus (et retirer les séquences codant pour des protéines se liant à des capsides par exemple), et seuls les identiants Uniprot associés à des structures PDB ont été conservés, En réduisant la redondance (90%) des séquences des PDB, nous avons obtenu un jeu de données de 327 chaînes de capsides

M. Le and P. , Marine Phage, Virus, and Virome Sequencing Project

B. Dation-gordon and . Moore, permis la publicatioin de génomes de phages marins dans la base de données iMicrobe http://www.imicrobe.us. Le code associé à ce projet dans iMicrobe est CAM_PROJ_BroadPhageGenomes. Il contient 20343 séquences protéiques parmis lesquelles 1172 n'ont aucune annotation

[. Bibliographie, A. Apweiler, C. H. Bairoch, and . Wu, Protein sequence databases, Current Opinion in Chemical Biology, vol.8, issue.1, pp.76-80, 2004.

F. Stephen, W. Altschul, W. Gish, . Miller, W. Eugene et al., Basic local alignment search tool, Journal of molecular biology, vol.215, issue.3, p.403410, 1990.

A. Andreeva, D. Howorth, C. Chothia, E. Kulesha, and A. G. Murzin, SCOP2 prototype: a new approach to protein structure mining, Nucleic Acids Research, vol.42, issue.D1, pp.310-314, 2014.
DOI : 10.1093/nar/gkt1242

[. Andonov, N. Malod-dognin, and N. Yanev, Maximum Contact Map Overlap Revisited, Journal of Computational Biology, vol.18, issue.1, p.2741, 2011.
DOI : 10.1089/cmb.2009.0196
URL : https://hal.archives-ouvertes.fr/inria-00536624

F. Stephen, . Altschul, L. Thomas, A. A. Madden, J. Schäer et al., Gapped blast and psi-blast : a new generation of protein database search programs, Nucleic acids research, issue.17, p.2533893402, 1997.

B. Christian and . Annsen, Studies on the principles that govern the folding of protein chains, 1972.

A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller et al., The pfam protein families database, Nucleic Acids Research, vol.30, issue.1, p.276280, 2002.
URL : https://hal.archives-ouvertes.fr/hal-01294685

M. Boccara, M. Carpentier, J. Chomilier, F. Coste, and C. Galiez, Joël Pothier, and Alaguraj Veluchamy. Identifying distant homologous viral sequences in metagenomes using protein structure information, ECCB'14 Workshop on Recent Computational Advances in Metagenomics, 2014.

Y. Savir, W. Liebermeister, D. Davidi, D. S. Tawk, and R. Milo, The moderately ecient enzyme : Evolutionary and physicochemical trends shaping enzyme parameters, Biochemistry, issue.1121, pp.5044024410-21506553, 2011.

. C. Bkw-+-77-]-f, T. F. Bernstein, G. J. Koetzle, E. F. Williams, M. D. Meyer et al., The Protein Data Bank : a computer-based archival le for macromolecular structures, Journal of molecular biology, vol.112, issue.3, p.535542, 1977.

A. Bhaduri, R. Ravishankar, and . Sowdhamini, Conserved spatially interacting motifs of protein superfamilies: Application to fold recognition and function annotation of genome data, Proteins: Structure, Function, and Bioinformatics, vol.30, issue.Suppl, p.657670, 2004.
DOI : 10.1002/prot.10638

I. Budowski-tal, Y. Nov, and R. Kolodny, FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately, Proceedings of the National Academy of Sciences, p.34813486, 2010.
DOI : 10.1073/pnas.0914097107

M. Carpentier, S. Brouillet, and J. Pothier, YAKUSA: A fast structural database scanning method, Proteins: Structure, Function, and Bioinformatics, vol.49, issue.Suppl 6, p.137151, 2005.
DOI : 10.1002/prot.20517

G. Camacho, V. Coulouris, N. Avagyan, J. Ma, K. Papadopoulos et al., BLAST+: architecture and applications, BMC Bioinformatics, vol.10, issue.1, p.421, 2009.
DOI : 10.1186/1471-2105-10-421

[. Chiang, T. I. Gelfand, A. E. Kister, and I. M. Gelfand, New classication of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage, Proteins : Structure, Function, and Bioinformatics, vol.68, issue.4, p.915921, 2007.

A. Camproux, P. Gautier, and . Tuéry, A Hidden Markov Model Derived Structural Alphabet for Proteins, Journal of Molecular Biology, vol.339, issue.3, pp.591-605, 2004.
DOI : 10.1016/j.jmb.2004.04.005

A. I. Culley, A. S. Lang, and C. A. Suttle, Metagenomic Analysis of Coastal RNA Virus Communities, Science, vol.312, issue.5781, pp.3121795-1798, 2006.
DOI : 10.1126/science.1127404

P. Ciaccia, M. Patella, and P. Zezula, M-tree : An ecient access method for similarity search in metric spaces, Proceedings of the 23 rd International Conference on Very Large Data Bases, VLDB '97, pp.426-435, 1997.

S. Y. , C. , and S. Subbiah, A structural explanation for the twilight zone of protein sequence homology, Structure, vol.4, issue.10, pp.1123-1127, 1996.

N. Cristianini and J. Shawe-taylor, An introduction to support vector machines and other kernel-based learning methods, 2000.
DOI : 10.1017/CBO9780511801389

J. Cooley and J. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, vol.19, issue.90, p.297301, 1965.
DOI : 10.1090/S0025-5718-1965-0178586-1

L. David, D. W. Davies, and . Bouldin, A cluster separation measure

]. A. De-brevern, C. Etchebest, and S. Hazout, Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins: Structure, Function, and Genetics, vol.7, issue.3
DOI : 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-Z
URL : https://hal.archives-ouvertes.fr/inserm-00132821

]. Dbeh00b, C. Ag-de-brevern, S. Etchebest, and . Hazout, Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins : Structure, Function, and Bioinformatics, vol.41, issue.3, p.271287, 2000.

L. Dib and A. Carbone, Protein Fragments: Functional and Structural Roles of Their Coevolution Networks, PLoS ONE, vol.52, issue.3, 2012.
DOI : 10.1371/journal.pone.0048124.s019

J. Dunbar, K. Krawczyk, J. Leem, T. Baker, A. Fuchs et al., SAbDab: the structural antibody database, Nucleic Acids Research, vol.42, issue.D1, pp.1140-1146, 2014.
DOI : 10.1093/nar/gkt1043

K. Pietro-di-lena, P. Nagata, and . Baldi, Deep architectures for protein contact map prediction, Bioinformatics, vol.28, p.24492457, 2012.

. K. Dom-+-08-]-a, C. J. Dunker, J. Oldeld, P. Meng, J. Y. Romero et al., The unfoldomics decade : an update on intrinsically disordered proteins, BMC Genomics, issue.9 2, p.1, 2008.

W. John and . Drake, The distribution of rates of spontaneous mutation over viruses, prokaryotes, and eukaryotes, Annals of the New York Academy of Sciences, vol.870, issue.1, p.100107, 1999.

M. Jose, R. Duarte, H. Sathyapriya, and . Stehr, Ioannis Filippis, and Michael Lappe. Optimal contact denition for reconstruction of contact maps, BMC bioinformatics, vol.11, issue.1, p.283, 2010.

[. Etchebest, C. Benros, S. Hazout, and A. G. De-brevern, A structural alphabet for local protein structures: Improved prediction methods, Proteins: Structure, Function, and Bioinformatics, vol.20, issue.4, p.810827, 2005.
DOI : 10.1002/prot.20458
URL : https://hal.archives-ouvertes.fr/inserm-00143564

S. Eddy, Prole hidden markov models, Bioinformatics, vol.14, issue.9, p.755763, 1998.

C. Robert and . Edgar, Muscle, Nucleic acids research, vol.32, issue.5, p.17921797, 2004.
DOI : 10.1007/978-1-349-13443-4_4
URL : https://hal.archives-ouvertes.fr/hal-00897814

I. Elias, Settling the intractability of multiple alignment, Lecture Notes in Computer Science, vol.2906, p.352363, 2003.

A. Robert, F. Edwards, and . Rohwer, Viral metagenomics, Nature Reviews Microbiology, vol.3, issue.6, p.504510, 2005.

Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, An ecient boosting algorithm for combining preferences, J. Mach. Learn. Res, vol.4, pp.933-969, 2003.

C. Galiez and F. Coste, Amplitude spectrum distance: measuring the global shape divergence of protein fragments, BMC Bioinformatics, vol.1, issue.2, p.256, 2015.
DOI : 10.1109/TPAMI.1979.4766909
URL : https://hal.archives-ouvertes.fr/hal-01214482

A. Richard, J. George, and . Heringa, An analysis of protein domain linkers : their classication and role in protein folding, Protein Engineering, vol.15, issue.11, p.871879, 2002.

B. Gkk-+-11-]-martin-gebser, R. Kaufmann, M. Kaminski, T. Ostrowski, M. Schaub et al., Potassco : The potsdam answer set solving collection, p.107124, 2011.

C. Galiez, C. Magnan, F. Coste, and P. Baldi, Viralpro : a new suite for identifying viral capsid and tail sequences, 2015.

G. Gordon, Support vector machines and kernel methods, 2004.

F. Guyon and P. Tuéry, Assessing 3D scores for protein structure fragment mining, Open Access Bioinformatics, vol.2, p.6777, 2010.

F. Guyon and P. Tuéry, Fast protein fragment similarity scoring using a binetcauchy kernel, Bioinformatics, 2013.

S. Heniko, G. Jorja, and . Heniko, Amino acid substitution matrices from protein blocks, Proceedings of the National Academy of Sciences Juha Karhunen, and Erkki Oja. Independent component analysis, p.1091510919, 1992.

C. Edward and . Holmes, The evolution of endogenous viral elements, Cell host & microbe, vol.10, issue.4, p.368377, 2011.

L. Holm and J. Park, DaliLite workbench for protein structure comparison, Bioinformatics, vol.16, issue.6, p.566567, 2000.
DOI : 10.1093/bioinformatics/16.6.566

N. Halabi, O. Rivoire, S. Leibler, and R. Ranganathan, Protein Sectors: Evolutionary Units of Three-Dimensional Structure, Cell, vol.138, issue.4, p.774786, 2009.
DOI : 10.1016/j.cell.2009.07.038

L. Bonnie, M. B. Hurwitz, and . Sullivan, The pacic ocean virome (pov) : A marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology, PLoS ONE, vol.8, issue.2, p.57355, 2013.

K. Holmfeldt, N. Solonenko, M. Shah, K. Corrier, L. Riemann et al., Twelve previously unknown phage genera are ubiquitous in global oceans, Proceedings of the National Academy of Sciences, p.1101279812803, 2013.
DOI : 10.1073/pnas.1305956110

K. Anil and . Jain, Fundamentals of Digital Image Processing, 1989.

D. T. Jones, D. W. Buchan, D. Cozzetto, and M. Pontil, PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments, Bioinformatics, vol.28, issue.2, p.184190, 2012.
DOI : 10.1093/bioinformatics/btr638

I. Jonassen, I. Eidhammer, D. Conklin, and W. R. Taylor, Structure motif discovery and mining the PDB, Bioinformatics, vol.18, issue.2, pp.362-367, 2002.
DOI : 10.1093/bioinformatics/18.2.362

I. Jonassen, Ecient discovery of conserved patterns using a pattern graph Computer applications in the biosciences, CABIOS, vol.13, issue.5, pp.509-522, 1997.

M. Jusot, Caractérisation en séquence et en structure des protéines virales, 2015.

S. Karlin, F. Stephen, and . Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes., Proceedings of the National Academy of Sciences, vol.87, issue.6, pp.2264-2268, 1990.
DOI : 10.1073/pnas.87.6.2264

]. W. Kab76 and . Kabsch, A solution for the best rotation to relate two sets of vectors

M. Krupovic, H. Dennis, and . Bamford, Double-stranded DNA viruses: 20 families and only five different architectural principles for virion assembly, Virus structure and function, pp.118-124, 2011.
DOI : 10.1016/j.coviro.2011.06.001

C. John, . Kendrew, . Bodo, M. Howard, . Dintzis et al., A three-dimensional model of the myoglobin molecule obtained by x-ray analysis, Nature, issue.4610, p.181662666, 1958.

[. Katoh, K. Misawa, K. Kuma, and T. Miyata, Mat : a novel method for rapid multiple sequence alignment based on fast fourier transform, Nucleic acids research, vol.30, issue.14, p.30593066, 2002.

P. Koehl, Protein structure similarities, Current Opinion in Structural Biology, vol.11, issue.3, p.348353, 2001.
DOI : 10.1016/S0959-440X(00)00214-1

C. Shuai, D. Li, X. Bu, J. Gao, M. Xu et al., Designing succinct structural alphabets, Bioinformatics, issue.13, pp.24-182189, 2008.

B. Lcah-+-00-]-loredana-lo-conte, T. J. Ailey, S. E. Hubbard, A. G. Brenner, C. Murzin et al., Scop : a structural classication of proteins database, Nucleic Acids Research, vol.28, issue.1, p.257259, 2000.

G. Lancia, R. Carr, B. Walenz, and S. Istrail, 101 optimal PDB structure alignments, Proceedings of the fifth annual international conference on Computational biology , RECOMB '01, p.193202, 2001.
DOI : 10.1145/369133.369199

M. Levitt, A simplified representation of protein conformations for rapid simulation of protein folding, Journal of Molecular Biology, vol.104, issue.1, pp.59-107, 1976.
DOI : 10.1016/0022-2836(76)90004-8

I. Lks-+-14-]-yoav-lehahn, D. Koren, M. Schatz, U. Frada, E. Sheyn et al., Decoupling physical from biological processes to assess the impact of viruses on a mesoscale algal bloom, Current Biology, issue.17, pp.242041-2046, 2014.

Q. Le, G. Pollastri, and P. Koehl, Structural Alphabets for Protein Structure Classication : A Comparison Study, Mac78] Saunders MacLane. Categories for the working mathematician, p.431450, 1978.

C. N. Magnan and P. Baldi, SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity, Bioinformatics, vol.30, issue.18, p.3025922597, 2014.
DOI : 10.1093/bioinformatics/btu352

M. Magrane and U. Consortium, UniProt Knowledgebase: a hub of integrated protein data, Database, vol.2011, issue.0, 2011.
DOI : 10.1093/database/bar009

. S. Debora, L. J. Marks, R. Colwell, T. A. Sheridan, A. Hopf et al., Protein 3D structure computed from evolutionary sequence variation, PloS one, vol.6, issue.12, p.28766, 2011.

. Mdf-+-14-]-bohdan-monastyrskyy, D. Daniel, K. Andrea, A. Fidelis, A. Tramontano et al., Evaluation of residueresidue contact prediction in casp10, Proteins : Structure, Function, and Bioinformatics, vol.82, p.138153, 2014.

K. Makarova and N. Grishin, Thermolysin and mitochondrial processing peptidase : how far structure-functional convergence goes, Protein Science, vol.8, issue.11, p.253740, 1999.

S. Debora, . Marks, A. Thomas, C. Hopf, and . Sander, Protein structure prediction from sequence variation, Nature biotechnology, vol.30, issue.11, pp.1072-1080, 2012.

A. Marin, J. Pothier, K. Zimmermann, and J. Gibrat, FROST: A filter-based fold recognition method, Proteins: Structure, Function, and Genetics, vol.34, issue.4, p.493509, 2002.
DOI : 10.1002/prot.10231

S. Minami, K. Sawada, and G. Chikenji, MICAN : a protein structure alignment algorithm that can handle Multiple-chains, Inverse alignments, C?? only models, Alternative alignments, and Non-sequential alignments, BMC Bioinformatics, vol.14, issue.1, p.24, 2013.
DOI : 10.1016/j.jmb.2005.12.084

C. Notredame, G. Desmond, J. Higgins, and . Heringa, T-coffee: a novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology, vol.302, issue.1, pp.205-217, 2000.
DOI : 10.1006/jmbi.2000.4042

[. North, A. Lehmann, and R. L. Dunbrack-jr, A New Clustering of Antibody CDR Loop Conformations, Journal of Molecular Biology, vol.406, issue.2, pp.228-256, 2011.
DOI : 10.1016/j.jmb.2010.10.030

B. Saul, C. D. Needleman, and . Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, vol.48, issue.3, pp.443-453, 1970.

R. Anna, A. Panchenko, . Marchler-bauer, H. Stephen, and . Bryant, Combination of threading potentials and sequence proles improves fold recognition, Journal of Molecular Biology, vol.296, issue.5, pp.1319-1331, 2000.

F. Pesant, M. Not, S. Picheral, N. L. Kandels-lewis, G. Bescot et al., Romain Troublé, et al. Open science resources for the discovery and analysis of tara oceans data, Scientic Data, 2015.

M. Punta and Y. Ofran, The Rough Guide to In Silico Function Prediction, or How To Use Sequence and Structure Information To Predict Protein Function, PLoS Computational Biology, vol.11, issue.10, p.1000160, 2008.
DOI : 10.1371/journal.pcbi.1000160.s001

M. Punta and B. Rost, PROFcon: novel prediction of long-range contacts, Bioinformatics, vol.21, issue.13, p.29602968, 2005.
DOI : 10.1093/bioinformatics/bti454

[. Pugalenthi, N. Ponnuthurai, R. Suganthan, S. Sowdhamini, and . Chakrabarti, MegaMotifBase: a database of structural motifs in protein families and superfamilies, Nucleic Acids Research, vol.36, issue.Database, pp.218-221, 2008.
DOI : 10.1093/nar/gkm794

K. D. Pruitt, T. Tatusova, G. R. Brown, and D. R. Maglott, NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy, Nucleic Acids Research, vol.40, issue.D1, pp.40-130, 2012.
DOI : 10.1093/nar/gkr1079

P. Røgen and B. Fain, Automatic classication of protein structure by using Gauss integrals, Proceedings of the National Academy of Sciences, vol.100, issue.1, p.119124, 2003.

J. Riguet, Relations binaires, fermetures, correspondances de galois Bulletin de la société mathématique de France, p.114155, 1948.

. Kamisetty-ramamohan-rao, . Do-nyeon, J. J. Kim, and . Hwang, Fast Fourier Transform-Algorithms and Applications, 2011.

[. Rost, Protein structures sustain evolutionary drift. Folding and Design, pp.19-24, 1997.

L. Royer, M. Reimann, B. Andreopoulos, and M. Schroeder, Unraveling Protein Networks with Power Graph Analysis, PLoS Computational Biology, vol.34, issue.7, 2008.
DOI : 10.1371/journal.pcbi.1000108.t004

J. Shapiro and D. Brutlag, FoldMiner: Structural motif discovery using an improved superposition algorithm, Protein Science, vol.13, issue.1, p.278294, 2004.
DOI : 10.1110/ps.03239404

J. A. Christian, E. D. Sigrist, L. Castro, B. A. Cerutti, N. Cuche et al., New and continuing developments at prosite, Nucleic Acids Research, vol.41, p.344347, 2013.

T. Kim, C. Simons, E. Kooperberg, D. Huang, and . Baker, Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions, Journal of Molecular Biology, vol.268, issue.1, p.209225, 1997.

T. Kim, C. Simons, E. Kooperberg, D. Huang, and . Baker, Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions1, Journal of Molecular Biology, vol.268, issue.1, pp.209-225, 1997.

T. E. Slc-+-15-]-ian-sillitoe, A. Lewis, S. Cu, P. Das, N. L. Ashford et al., CATH : comprehensive structural and functional annotations for genome sequences, Nucleic Acids Research, issue.D1, pp.43-376, 2015.

C. A. Suttle, Marine viruses ??? major players in the global ecosystem, Nature Reviews Microbiology, vol.46, issue.10, 2007.
DOI : 10.1038/nrmicro1750

F. Temple, . Smith, S. Michael, and . Waterman, Identication of common molecular subsequences, Journal of molecular biology, vol.147, issue.1, p.195197, 1981.

. Royd, P. Sleator, and . Walsh, An overview of in silico protein function prediction, Archives of Microbiology, vol.192, issue.3, p.151155, 2010.

. Tyagi, S. Venkataraman, N. Gowri, A. G. Srinivasan, B. De-brevern et al., A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications, Proteins: Structure, Function, and Bioinformatics, vol.272, issue.1, p.3239, 2006.
DOI : 10.1002/prot.21087
URL : https://hal.archives-ouvertes.fr/inserm-00133760

L. Jerey, H. Thorne, J. Kishino, and . Felsenstein, An evolutionary model for maximum likelihood alignment of dna sequences, Journal of Molecular Evolution, vol.33, issue.2, p.114124, 1991.

I. Wohlers, R. Andonov, and G. W. Klau, DALIX: Optimal DALI Protein Structure Alignment, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, issue.1, 2012.
DOI : 10.1109/TCBB.2012.143

. E. Wbp-+-12-]-k, J. Wommack, S. W. Bhavsar, J. Polson, M. Chen et al., VIROME : a standard operating procedure for analysis of viral metagenome sequences, Standards in Genomic Sciences, vol.6, issue.3, pp.427-439, 2012.

A. [. Wheelan, S. H. Marchler-bauer, and . Bryant, Domain size distributions can predict domain boundaries, Bioinformatics, vol.16, issue.7, p.613618, 2000.
DOI : 10.1093/bioinformatics/16.7.613

C. Xuefeng, L. S. Cheng, H. Lin, and L. Ming, Fingerprinting protein structures eectively and eciently, Bioinformatics, 2013.

Y. Zhang and J. Skolnick, Scoring function for automated assessment of protein structure template quality, Proteins : Structure, Function, and Bioinformatics, vol.57, issue.4, p.702710, 2004.

Y. Zhang and J. Skolnick, TM-align: a protein structure alignment algorithm based on the TM-score, Nucleic Acids Research, vol.33, issue.7, p.23022309, 2005.
DOI : 10.1093/nar/gki524