MC11] avec une requête cherchant les capsides et nucleocapsides (type de capside pour lequel la capside est directement reliée au matériel génétique) Nous avons ensuite appliqué un ltre taxonomique pour ne garder que les virus (et retirer les séquences codant pour des protéines se liant à des capsides par exemple), et seuls les identiants Uniprot associés à des structures PDB ont été conservés, En réduisant la redondance (90%) des séquences des PDB, nous avons obtenu un jeu de données de 327 chaînes de capsides ,
Marine Phage, Virus, and Virome Sequencing Project ,
permis la publicatioin de génomes de phages marins dans la base de données iMicrobe http://www.imicrobe.us. Le code associé à ce projet dans iMicrobe est CAM_PROJ_BroadPhageGenomes. Il contient 20343 séquences protéiques parmis lesquelles 1172 n'ont aucune annotation ,
Protein sequence databases, Current Opinion in Chemical Biology, vol.8, issue.1, pp.76-80, 2004. ,
Basic local alignment search tool, Journal of molecular biology, vol.215, issue.3, p.403410, 1990. ,
SCOP2 prototype: a new approach to protein structure mining, Nucleic Acids Research, vol.42, issue.D1, pp.310-314, 2014. ,
DOI : 10.1093/nar/gkt1242
Maximum Contact Map Overlap Revisited, Journal of Computational Biology, vol.18, issue.1, p.2741, 2011. ,
DOI : 10.1089/cmb.2009.0196
URL : https://hal.archives-ouvertes.fr/inria-00536624
Gapped blast and psi-blast : a new generation of protein database search programs, Nucleic acids research, issue.17, p.2533893402, 1997. ,
Studies on the principles that govern the folding of protein chains, 1972. ,
The pfam protein families database, Nucleic Acids Research, vol.30, issue.1, p.276280, 2002. ,
URL : https://hal.archives-ouvertes.fr/hal-01294685
Joël Pothier, and Alaguraj Veluchamy. Identifying distant homologous viral sequences in metagenomes using protein structure information, ECCB'14 Workshop on Recent Computational Advances in Metagenomics, 2014. ,
The moderately ecient enzyme : Evolutionary and physicochemical trends shaping enzyme parameters, Biochemistry, issue.1121, pp.5044024410-21506553, 2011. ,
The Protein Data Bank : a computer-based archival le for macromolecular structures, Journal of molecular biology, vol.112, issue.3, p.535542, 1977. ,
Conserved spatially interacting motifs of protein superfamilies: Application to fold recognition and function annotation of genome data, Proteins: Structure, Function, and Bioinformatics, vol.30, issue.Suppl, p.657670, 2004. ,
DOI : 10.1002/prot.10638
FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately, Proceedings of the National Academy of Sciences, p.34813486, 2010. ,
DOI : 10.1073/pnas.0914097107
YAKUSA: A fast structural database scanning method, Proteins: Structure, Function, and Bioinformatics, vol.49, issue.Suppl 6, p.137151, 2005. ,
DOI : 10.1002/prot.20517
BLAST+: architecture and applications, BMC Bioinformatics, vol.10, issue.1, p.421, 2009. ,
DOI : 10.1186/1471-2105-10-421
New classication of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage, Proteins : Structure, Function, and Bioinformatics, vol.68, issue.4, p.915921, 2007. ,
A Hidden Markov Model Derived Structural Alphabet for Proteins, Journal of Molecular Biology, vol.339, issue.3, pp.591-605, 2004. ,
DOI : 10.1016/j.jmb.2004.04.005
Metagenomic Analysis of Coastal RNA Virus Communities, Science, vol.312, issue.5781, pp.3121795-1798, 2006. ,
DOI : 10.1126/science.1127404
M-tree : An ecient access method for similarity search in metric spaces, Proceedings of the 23 rd International Conference on Very Large Data Bases, VLDB '97, pp.426-435, 1997. ,
A structural explanation for the twilight zone of protein sequence homology, Structure, vol.4, issue.10, pp.1123-1127, 1996. ,
An introduction to support vector machines and other kernel-based learning methods, 2000. ,
DOI : 10.1017/CBO9780511801389
An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, vol.19, issue.90, p.297301, 1965. ,
DOI : 10.1090/S0025-5718-1965-0178586-1
A cluster separation measure ,
Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins: Structure, Function, and Genetics, vol.7, issue.3 ,
DOI : 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-Z
URL : https://hal.archives-ouvertes.fr/inserm-00132821
Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins : Structure, Function, and Bioinformatics, vol.41, issue.3, p.271287, 2000. ,
Protein Fragments: Functional and Structural Roles of Their Coevolution Networks, PLoS ONE, vol.52, issue.3, 2012. ,
DOI : 10.1371/journal.pone.0048124.s019
SAbDab: the structural antibody database, Nucleic Acids Research, vol.42, issue.D1, pp.1140-1146, 2014. ,
DOI : 10.1093/nar/gkt1043
Deep architectures for protein contact map prediction, Bioinformatics, vol.28, p.24492457, 2012. ,
The unfoldomics decade : an update on intrinsically disordered proteins, BMC Genomics, issue.9 2, p.1, 2008. ,
The distribution of rates of spontaneous mutation over viruses, prokaryotes, and eukaryotes, Annals of the New York Academy of Sciences, vol.870, issue.1, p.100107, 1999. ,
Ioannis Filippis, and Michael Lappe. Optimal contact denition for reconstruction of contact maps, BMC bioinformatics, vol.11, issue.1, p.283, 2010. ,
A structural alphabet for local protein structures: Improved prediction methods, Proteins: Structure, Function, and Bioinformatics, vol.20, issue.4, p.810827, 2005. ,
DOI : 10.1002/prot.20458
URL : https://hal.archives-ouvertes.fr/inserm-00143564
Prole hidden markov models, Bioinformatics, vol.14, issue.9, p.755763, 1998. ,
Muscle, Nucleic acids research, vol.32, issue.5, p.17921797, 2004. ,
DOI : 10.1007/978-1-349-13443-4_4
URL : https://hal.archives-ouvertes.fr/hal-00897814
Settling the intractability of multiple alignment, Lecture Notes in Computer Science, vol.2906, p.352363, 2003. ,
Viral metagenomics, Nature Reviews Microbiology, vol.3, issue.6, p.504510, 2005. ,
An ecient boosting algorithm for combining preferences, J. Mach. Learn. Res, vol.4, pp.933-969, 2003. ,
Amplitude spectrum distance: measuring the global shape divergence of protein fragments, BMC Bioinformatics, vol.1, issue.2, p.256, 2015. ,
DOI : 10.1109/TPAMI.1979.4766909
URL : https://hal.archives-ouvertes.fr/hal-01214482
An analysis of protein domain linkers : their classication and role in protein folding, Protein Engineering, vol.15, issue.11, p.871879, 2002. ,
Potassco : The potsdam answer set solving collection, p.107124, 2011. ,
Viralpro : a new suite for identifying viral capsid and tail sequences, 2015. ,
Support vector machines and kernel methods, 2004. ,
Assessing 3D scores for protein structure fragment mining, Open Access Bioinformatics, vol.2, p.6777, 2010. ,
Fast protein fragment similarity scoring using a binetcauchy kernel, Bioinformatics, 2013. ,
Amino acid substitution matrices from protein blocks, Proceedings of the National Academy of Sciences Juha Karhunen, and Erkki Oja. Independent component analysis, p.1091510919, 1992. ,
The evolution of endogenous viral elements, Cell host & microbe, vol.10, issue.4, p.368377, 2011. ,
DaliLite workbench for protein structure comparison, Bioinformatics, vol.16, issue.6, p.566567, 2000. ,
DOI : 10.1093/bioinformatics/16.6.566
Protein Sectors: Evolutionary Units of Three-Dimensional Structure, Cell, vol.138, issue.4, p.774786, 2009. ,
DOI : 10.1016/j.cell.2009.07.038
The pacic ocean virome (pov) : A marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology, PLoS ONE, vol.8, issue.2, p.57355, 2013. ,
Twelve previously unknown phage genera are ubiquitous in global oceans, Proceedings of the National Academy of Sciences, p.1101279812803, 2013. ,
DOI : 10.1073/pnas.1305956110
Fundamentals of Digital Image Processing, 1989. ,
PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments, Bioinformatics, vol.28, issue.2, p.184190, 2012. ,
DOI : 10.1093/bioinformatics/btr638
Structure motif discovery and mining the PDB, Bioinformatics, vol.18, issue.2, pp.362-367, 2002. ,
DOI : 10.1093/bioinformatics/18.2.362
Ecient discovery of conserved patterns using a pattern graph Computer applications in the biosciences, CABIOS, vol.13, issue.5, pp.509-522, 1997. ,
Caractérisation en séquence et en structure des protéines virales, 2015. ,
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes., Proceedings of the National Academy of Sciences, vol.87, issue.6, pp.2264-2268, 1990. ,
DOI : 10.1073/pnas.87.6.2264
A solution for the best rotation to relate two sets of vectors ,
Double-stranded DNA viruses: 20 families and only five different architectural principles for virion assembly, Virus structure and function, pp.118-124, 2011. ,
DOI : 10.1016/j.coviro.2011.06.001
A three-dimensional model of the myoglobin molecule obtained by x-ray analysis, Nature, issue.4610, p.181662666, 1958. ,
Mat : a novel method for rapid multiple sequence alignment based on fast fourier transform, Nucleic acids research, vol.30, issue.14, p.30593066, 2002. ,
Protein structure similarities, Current Opinion in Structural Biology, vol.11, issue.3, p.348353, 2001. ,
DOI : 10.1016/S0959-440X(00)00214-1
Designing succinct structural alphabets, Bioinformatics, issue.13, pp.24-182189, 2008. ,
Scop : a structural classication of proteins database, Nucleic Acids Research, vol.28, issue.1, p.257259, 2000. ,
101 optimal PDB structure alignments, Proceedings of the fifth annual international conference on Computational biology , RECOMB '01, p.193202, 2001. ,
DOI : 10.1145/369133.369199
A simplified representation of protein conformations for rapid simulation of protein folding, Journal of Molecular Biology, vol.104, issue.1, pp.59-107, 1976. ,
DOI : 10.1016/0022-2836(76)90004-8
Decoupling physical from biological processes to assess the impact of viruses on a mesoscale algal bloom, Current Biology, issue.17, pp.242041-2046, 2014. ,
Structural Alphabets for Protein Structure Classication : A Comparison Study, Mac78] Saunders MacLane. Categories for the working mathematician, p.431450, 1978. ,
SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity, Bioinformatics, vol.30, issue.18, p.3025922597, 2014. ,
DOI : 10.1093/bioinformatics/btu352
UniProt Knowledgebase: a hub of integrated protein data, Database, vol.2011, issue.0, 2011. ,
DOI : 10.1093/database/bar009
Protein 3D structure computed from evolutionary sequence variation, PloS one, vol.6, issue.12, p.28766, 2011. ,
Evaluation of residueresidue contact prediction in casp10, Proteins : Structure, Function, and Bioinformatics, vol.82, p.138153, 2014. ,
Thermolysin and mitochondrial processing peptidase : how far structure-functional convergence goes, Protein Science, vol.8, issue.11, p.253740, 1999. ,
Protein structure prediction from sequence variation, Nature biotechnology, vol.30, issue.11, pp.1072-1080, 2012. ,
FROST: A filter-based fold recognition method, Proteins: Structure, Function, and Genetics, vol.34, issue.4, p.493509, 2002. ,
DOI : 10.1002/prot.10231
MICAN : a protein structure alignment algorithm that can handle Multiple-chains, Inverse alignments, C?? only models, Alternative alignments, and Non-sequential alignments, BMC Bioinformatics, vol.14, issue.1, p.24, 2013. ,
DOI : 10.1016/j.jmb.2005.12.084
T-coffee: a novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology, vol.302, issue.1, pp.205-217, 2000. ,
DOI : 10.1006/jmbi.2000.4042
A New Clustering of Antibody CDR Loop Conformations, Journal of Molecular Biology, vol.406, issue.2, pp.228-256, 2011. ,
DOI : 10.1016/j.jmb.2010.10.030
A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, vol.48, issue.3, pp.443-453, 1970. ,
Combination of threading potentials and sequence proles improves fold recognition, Journal of Molecular Biology, vol.296, issue.5, pp.1319-1331, 2000. ,
Romain Troublé, et al. Open science resources for the discovery and analysis of tara oceans data, Scientic Data, 2015. ,
The Rough Guide to In Silico Function Prediction, or How To Use Sequence and Structure Information To Predict Protein Function, PLoS Computational Biology, vol.11, issue.10, p.1000160, 2008. ,
DOI : 10.1371/journal.pcbi.1000160.s001
PROFcon: novel prediction of long-range contacts, Bioinformatics, vol.21, issue.13, p.29602968, 2005. ,
DOI : 10.1093/bioinformatics/bti454
MegaMotifBase: a database of structural motifs in protein families and superfamilies, Nucleic Acids Research, vol.36, issue.Database, pp.218-221, 2008. ,
DOI : 10.1093/nar/gkm794
NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy, Nucleic Acids Research, vol.40, issue.D1, pp.40-130, 2012. ,
DOI : 10.1093/nar/gkr1079
Automatic classication of protein structure by using Gauss integrals, Proceedings of the National Academy of Sciences, vol.100, issue.1, p.119124, 2003. ,
Relations binaires, fermetures, correspondances de galois Bulletin de la société mathématique de France, p.114155, 1948. ,
Fast Fourier Transform-Algorithms and Applications, 2011. ,
Protein structures sustain evolutionary drift. Folding and Design, pp.19-24, 1997. ,
Unraveling Protein Networks with Power Graph Analysis, PLoS Computational Biology, vol.34, issue.7, 2008. ,
DOI : 10.1371/journal.pcbi.1000108.t004
FoldMiner: Structural motif discovery using an improved superposition algorithm, Protein Science, vol.13, issue.1, p.278294, 2004. ,
DOI : 10.1110/ps.03239404
New and continuing developments at prosite, Nucleic Acids Research, vol.41, p.344347, 2013. ,
New and continuing developments at prosite, Nucleic Acids Research, vol.41, p.344347, 2013. ,
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions, Journal of Molecular Biology, vol.268, issue.1, p.209225, 1997. ,
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions1, Journal of Molecular Biology, vol.268, issue.1, pp.209-225, 1997. ,
CATH : comprehensive structural and functional annotations for genome sequences, Nucleic Acids Research, issue.D1, pp.43-376, 2015. ,
Marine viruses ??? major players in the global ecosystem, Nature Reviews Microbiology, vol.46, issue.10, 2007. ,
DOI : 10.1038/nrmicro1750
Identication of common molecular subsequences, Journal of molecular biology, vol.147, issue.1, p.195197, 1981. ,
An overview of in silico protein function prediction, Archives of Microbiology, vol.192, issue.3, p.151155, 2010. ,
A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications, Proteins: Structure, Function, and Bioinformatics, vol.272, issue.1, p.3239, 2006. ,
DOI : 10.1002/prot.21087
URL : https://hal.archives-ouvertes.fr/inserm-00133760
An evolutionary model for maximum likelihood alignment of dna sequences, Journal of Molecular Evolution, vol.33, issue.2, p.114124, 1991. ,
DALIX: Optimal DALI Protein Structure Alignment, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, issue.1, 2012. ,
DOI : 10.1109/TCBB.2012.143
VIROME : a standard operating procedure for analysis of viral metagenome sequences, Standards in Genomic Sciences, vol.6, issue.3, pp.427-439, 2012. ,
Domain size distributions can predict domain boundaries, Bioinformatics, vol.16, issue.7, p.613618, 2000. ,
DOI : 10.1093/bioinformatics/16.7.613
Fingerprinting protein structures eectively and eciently, Bioinformatics, 2013. ,
Scoring function for automated assessment of protein structure template quality, Proteins : Structure, Function, and Bioinformatics, vol.57, issue.4, p.702710, 2004. ,
TM-align: a protein structure alignment algorithm based on the TM-score, Nucleic Acids Research, vol.33, issue.7, p.23022309, 2005. ,
DOI : 10.1093/nar/gki524