, Protein representation Group: Core::CSB / Package: Protein representation

, classes Pack: Protein representation / Class: Polypeptide chain representation Gives access to a number of high level accessors and iterators to manipulate a polypeptide chain. Allows access to the three structures described in the previous sub-section in a unique class. Pack: Protein representation / Class: Protein representation Gives access to a number of specified polypeptide chains from a protein quaternary structure, The SBL provides many applications which rely on different structures tied to a polypeptide chain: ? topological information i.e. the covalent bondsGroup: Core::CSB / Package: Molecular covalent structure

, Molecular distances / Class: SBL::CSB::RMSD comb for motifs We provide a new class for the Molecular distances package. Given a set of structural motifs, this class builds the motif graph

, Group: SBL::Applications / Package: Molecular distances flexible We provide an application which, given a set of polypeptide chains as well as "subdomain" definitions (labeled residue ranges), computes the RMSD Comb.. The specification of labels is provided from SBL::Models::MolecularSystemLabelTraits. Example specification files can be found in the documentation. The application provides three executables: ? sbl-flexible-rmsd-proteins, Molecular distances / Class: SBL::Modules::RMSD comb for motifs module We provide the module enabling the use of the previous class in a workflow

, ? sbl-flexible-rmsd-conformations.exe is used to compare conformations of an identical protein ? sbl-flexible-rmsd-motifs.exe is used to compute the RMSD Comb. of two chains with user specified structural motifs

, Pre-requisites Following the contributions from ADDREF, we provide a novel package in the SBL. Given two polypeptide chains, the goal of this package is to identify structural motifs using any of the four methods from ADDREF, SBL::Applications / Package: Structural motifs Bibliography

R. Aldahdooh and W. Ashour, DSMK means density-based split-and-merge k-means clustering algorithm, Journal of Artificial Intelligence and Soft Computing Research, vol.3, issue.1, pp.51-71, 2013.

N. Akkiraju and H. Edelsbrunner, Triangulating the surface of a molecule, Discrete Appl. Math, vol.71, pp.5-22, 1996.

K. S. Arun, T. S. Huang, and S. D. Blostein, Least-square fitting of two 3D point sets, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.9, issue.5, pp.698-700, 1987.

R. Andonov, N. Malod-dognin, and N. Yanev, Maximum Contact Map Overlap Revisited, J. of Computational Biology, vol.18, issue.1, pp.1-15, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00536624

S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, NAR, vol.25, issue.17, pp.3389-3402, 1997.

F. Aurenhammer, Power diagrams: properties, algorithms and applications, SIAM J. Comput, vol.16, pp.78-96, 1987.

D. Arthur, S. Vassilvitskii.-k-means++, ;. B. Bursteinas, R. Britto, B. Bely et al., Minimizing proteome redundancy in the uniprot knowledgebase, ACMSODA, page 1035, 2007.

J. Baldwin and C. Chothia, Haemoglobin: the structural changes related to ligand binding and its allosteric mechanism, JMB, vol.129, issue.2, pp.175-220, 1979.

C. Barrett, R. Hughey, and K. Karplus, Scoring hidden markov models, Computer applications in the biosciences, vol.13, pp.191-199, 1997.

H. Berman, K. Henrick, and H. Nakamura, Announcing the worldwide protein data bank, vol.10, 2003.

J. Barthélemy and B. Leclerc, The median procedure for partitions. Partitioning data sets, vol.19, pp.3-34, 1993.

R. Blankenbecler, M. Ohlsson, C. Peterson, M. Ringnér-;-j-p, A. Baudry et al., Matching protein structures with fuzzy alignments, Journal of computational and graphical statistics, vol.100, issue.21, pp.332-353, 2003.

M. Betancourt and J. Skolnick, Universal similarity measure for comparing protein structures, Biopolymers, vol.59, issue.5, pp.305-309, 2001.

S. Bressanelli, K. Stiasny, S. Allison, E. Stura, S. Duquerroy et al., Structure of a flavivirus envelope glycoprotein in its low-ph-induced membrane fusion conformation, The EMBO journal, vol.23, issue.4, pp.728-738, 2004.

C. Branden and J. Tooze, , 1998.

J. Boissonnat and M. Yvinec, Algorithmic geometry, 1998.

G. Csaba, F. Birzele, and R. Zimmer, Protein structure alignment considering phenotypic plasticity, Bioinformatics, vol.24, issue.16, pp.98-104, 2008.

F. Cazals and T. Dreyfus, The Structural Bioinformatics Library: modeling in biomolecular science and beyond, Bioinformatics, vol.7, issue.33, pp.1-8, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01379635

F. Chazal, L. Guibas, S. Oudot, and P. Skraba, Persistence-based clustering in riemannian manifolds, J. ACM, vol.60, issue.6, pp.1-38, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01094872

J. Chen, M. Guo, X. Wang, and B. Liu, A comprehensive review and comparison of different computational methods for protein remote homology detection, Briefings in bioinformatics, vol.19, issue.2, pp.231-244, 2016.

Y. Cheng, Mean shift, mode seeking, and clustering, IEEE PAMI, vol.17, issue.8, pp.790-799, 1995.

P. Crescenzi and V. Kann, How to find the best approximation results-a follow-up to garey and johnson, ACM SIGACT News, vol.29, issue.4, pp.90-97, 1998.

F. Cazals, H. Kanhere, and S. Loriot, Computing the volume of union of balls: a certified algorithm, ACM Transactions on Mathematical Software, vol.38, issue.1, pp.1-20, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00849809

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, 2009.

F. Cazals, D. Mazauric, R. Tetley, and R. Watrigant, Comparing two clusterings using matchings between clusters of clusters, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01514872

F. Chataigner, G. Manic, Y. Wakabayashi, and R. Yuster, Approximation algorithms and hardness results for the clique packing problem, Disc. Appl. Math, vol.157, issue.7, pp.1396-1406, 2009.

D. Cohen-steiner, H. Edelsbrunner, and J. Harer, Stability of persistence diagrams, Discrete & Computational Geometry, vol.37, issue.1, pp.103-120, 2007.

T. M. Cover and J. A. Thomas, Elements of Information Theory, 2006.

F. Cazals and R. Tetley, Characterizing molecular flexibility by combining lRMSD measures, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01968175

F. Cazals and R. Tetley, Multiscale analysis of structurally conserved motifs, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01968176

S. Dey, P. Chakrabarti, and J. Janin, A survey of hemoglobin quaternary structures, Proteins: Structure, Function, and Bioinformatics, vol.79, issue.10, pp.2861-2870, 2011.

K. Dabrowski, M. Demange, and V. V. Lozin, New results on maximum induced matchings in bipartite graphs and beyond, Theoretical Computer Science, vol.478, pp.33-40, 2013.

P. Duchon, P. Flajolet, G. Louchard, and G. Schaeffer, Boltzmann samplers for the random generation of combinatorial structures, Combinatorics, Probability and Computing, vol.13, issue.45, pp.577-625, 2004.
URL : https://hal.archives-ouvertes.fr/hal-00307530

R. O. Duda and P. E. Hart, Pattern classification and scene analysis, 1973.

S. Dongen, Performance criteria for graph clustering and markov cluster experiments, 2000.

S. R. Eddy, Profile hidden markov models, Bioinformatics, vol.14, issue.9, pp.755-763, 1998.

S. R. Eddy, A probabilistic model of local sequence alignment that simplifies statistical significance estimation, PLoS Comput Biol, vol.4, issue.5, p.1000069, 2008.

S. Eddy, HMMER user's guide. biological sequence analysis using profile hidden markov models, 2015.

H. Edelsbrunner, Weighted alpha shapes, Dept. Comput. Sci., Univ. Illinois, 1992.

H. Edelsbrunner, The union of balls and its dual shape, Discrete Comput. Geom, vol.13, pp.415-440, 1995.

R. C. Edgar, Muscle: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, vol.32, issue.5, pp.1792-1797, 2004.
DOI : 10.1093/nar/gkh340
URL : http://europepmc.org/articles/pmc390337?pdf=render

H. Edelsbrunner and J. Harer, Computational topology: an introduction, 2010.

W. Eaton, E. Henry, J. Hofrichter, and A. Mozzarelli, Is cooperative oxygen binding by hemoglobin really understood? Rendiconti Lincei, vol.17, pp.147-162, 2006.
DOI : 10.1007/bf02904506

R. Finn, J. Clements, and S. R. Eddy, HMMER web server: interactive sequence similarity searching, NAR, p.367, 2011.
DOI : 10.1093/nar/gkr367
URL : https://academic.oup.com/nar/article-pdf/39/suppl_2/W29/7628106/gkr367.pdf

S. Federhen, The ncbi taxonomy database, Nucleic acids research, vol.40, issue.D1, pp.136-143, 2012.

A. Fersht, Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding, 1999.

J. Fedry, J. Forcina, P. Legrand, G. Pehau-arnaudet, A. Haouz et al., Evolutionary diversification of the HAP2 membrane insertion motifs to drive gamete fusion across eukaryotes, PLoS Biology, 2018.

A. Fred and A. K. Jain, Data clustering using evidence accumulation, Proceedings. 16th International Conference on, vol.4, pp.276-280, 2002.
DOI : 10.1109/icpr.2002.1047450
URL : http://www.cse.msu.edu/prip/Files/AFred_AJain_ICPR2002.pdf

J. Fédry, Y. Liu, G. Péhau-arnaudet, J. Pei, W. Li et al., The ancient gamete fusogen hap2 is a eukaryotic class ii fusion protein, Cell, vol.168, issue.5, pp.904-915, 2017.

P. Flajolet and R. Sedgewick, Analytic combinatorics, 2009.
DOI : 10.1017/cbo9780511801655
URL : https://hal.archives-ouvertes.fr/inria-00072739

M. Fredman and R. Tarjan, Fibonacci heaps and their uses in improved network optimization algorithms, J. ACM, vol.34, issue.3, pp.596-615, 1987.

V. Garcia, A generative cell specific 1 ortholog in drosophila melanogaster, 2012.

P. Guardado-calvo and F. A. Rey, The envelope proteins of the bunyavirales, Advances in Virus Research, vol.98, pp.83-118, 2017.

O. Goldschmidt and D. S. Hochbaum, A polynomial algorithm for the k-cut problem for fixed k, Mathematics of operations research, vol.19, 1994.

D. Goldman, S. Istrail, and C. Papadimitriou, Algorithmic aspects of protein structure similarity, Foundations of Computer Science, 1999. 40th Annual Symposium on, pp.512-521, 1999.

R. Graham, D. Knuth, and O. Patashnik, Concrete mathematics: a foundation for computer science, 1989.

M. Gerstein and M. Levitt, Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins, Protein Science, vol.7, issue.2, pp.445-456, 1998.

J. E. Goodman, J. O'rourke, and C. D. Tóth, Handbook of Discrete and Computationnal Geometry, 2017.

M. Gerstein and F. M. Richards, Protein geometry: volumes, areas, and distances, The international tables for crystallography, pp.531-539, 2001.

A. Godzik and J. Skolnick, Flexible algorithm for direct multiple alignment of protein structures and sequences, Bioinformatics, vol.10, issue.6, p.587, 1994.

F. Guyon and P. Tufféry, Fast protein fragment similarity scoring using a Binet-Cauchy kernel, Bioinformatics, vol.30, issue.6, pp.784-791, 2014.

S. C. Harrison, Viral membrane fusion. Virology, pp.498-507, 2015.

H. Hasegawa and L. Holm, Advances and pitfalls of protein structural alignment, Current opinion in structural biology, vol.19, issue.3, pp.341-348, 2009.

L. Holm and C. Sander, Protein structure comparison by alignment of distance matrices, Journal of molecular biology, vol.233, issue.1, pp.123-138, 1993.

L. Holm and C. Sander, Dali: a network tool for protein structure comparison, Trends in biochemical sciences, vol.20, issue.11, pp.478-480, 1995.

A. K. Jain, Data clustering: 50 years beyond k-means, Pattern recognition letters, vol.31, issue.8, pp.651-666, 2010.

J. Jong, J. Park, K. Karplus, C. Barrett, R. Hughey et al., Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods, JMB, vol.284, issue.4, pp.1201-1210, 1998.

W. Kabsch, A solution for the best rotation to relate two sets of vectors, Acta Crystallographica Section A, vol.32, issue.5, pp.922-923, 1976.

V. Kann, Maximum bounded 3-dimensional matching is MAX SNP-complete, Inf. Process. Lett, vol.37, issue.1, pp.27-35, 1991.

R. M. Karp, Reducibility among combinatorial problems, Complexity of Computer Computations, pp.85-103, 1972.

K. Karplus, C. Barrett, R. Hughey, ;. Krogh, M. Brown et al., Hidden Markov Models in computational biology: Applications to protein modeling, Bioinformatics, vol.14, issue.10, pp.1501-1531, 1994.

K. Kedem, P. Chew, and R. Elber, Unit-vector rms (urms) as a tool to analyze molecular dynamics trajectories, Proteins: Structure, Function, and Bioinformatics, vol.37, issue.4, pp.554-564, 1999.

M. Kielian, Mechanisms of virus membrane fusion proteins, Ann. Rev. Virol, vol.1, pp.171-89, 2014.

L. Käll, A. Krogh, and E. Sonnhammer, A combined transmembrane topology and signal peptide prediction method, Journal of molecular biology, vol.338, issue.5, pp.1027-1036, 2004.

K. Karplus, R. Karchin, G. Shackelford, and R. Hughey, Calibrating e-values for hidden markov models using reverse-sequence null models, Bioinformatics, vol.21, issue.22, pp.4107-4115, 2005.

R. Kolodny and N. Linial, Approximate protein structural alignment in polynomial time, vol.101, pp.12201-12206, 2004.

T. Kodinariya and P. Prashant, Review on determining number of cluster in k-means clustering, International Journal, vol.1, issue.6, pp.90-95, 2013.

M. Kielian and F. Rey, Virus membrane-fusion proteins: more than one way to make a hairpin, Nature Reviews Microbiology, vol.4, issue.1, pp.67-76, 2006.

J. Kleinberg and . Tardos, Algorithm design. Pearson Education India, 2006.

R. Kannan, S. Vempala, and A. Vetta, On clusterings: Good, bad and spectral, Journal of the ACM (JACM), vol.51, issue.3, pp.497-515, 2004.

B. Larsen and C. Aone, Fast and effective text mining using linear-time document clustering, ACM SIGKDD, pp.16-22, 1999.

C. Leslin, A. Abyzov, and V. Ilyin, TOPOFIT-DB, a database of protein structural alignments based on the topofit method, Nucleic acids research, vol.35, issue.1, pp.317-321, 2006.

P. Liu, D. Agrafiotis, and D. Theobald, Fast determination of the optimal rotational matrix for macromolecular superpositions, Journal of computational chemistry, vol.31, issue.7, pp.1561-1563, 2010.

M. L. Levitt-;-c, A. Lawson, M. L. Patwardhan, C. Baker, E. S. Hryc et al., Emdatabank unified data resource for 3dem, Growth of novel protein structural data, vol.104, pp.3183-3188, 2007.

D. Lee, O. Redfern, and C. Orengo, Predicting protein function from sequence and structure, Nature Reviews Molecular Cell Biology, vol.8, issue.12, pp.995-1005, 2007.

U. and V. Luxburg, Clustering Stability, 2010.

R. Laskowski, J. Watson, and J. Thornton, ProFunc: a server for predicting protein function from 3d structure, Nucleic acids research, vol.33, issue.2, pp.89-93, 2005.

B. Long, Z. Zhang, and P. Yu, Combining multiple clusterings by soft correspondence, IEEE Int'l Conf. on Data Mining, 2005.

A. Mitrophanov and M. Borodovsky, Statistical significance in biological sequence analysis, Briefings in Bioinformatics, vol.7, issue.1, pp.2-24, 2006.

V. Mariani, M. Biasini, A. Barbato, and T. Schwede, lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests, Bioinformatics, vol.29, issue.21, pp.2722-2728, 2013.

V. Maiorov and G. Crippen, Size-independent comparison of protein three-dimensional structures, Proteins: Structure, Function, and Bioinformatics, vol.22, issue.3, pp.273-283, 1995.

N. Malod-dognin, R. Andonov, and N. Yanev, Maximum clique in protein structure comparison, 9th International Symposium on Experimental Algorithms, pp.106-117, 2010.

G. Mayr, F. Domingues, and P. Lackner, Comparative analysis of protein structure alignments, BMC Structural Biology, vol.7, issue.1, p.50, 2007.

M. Meila, Comparing clusterings, 2002.

L. Meng, F. Sun, X. Zhang, and M. S. Waterman, Sequence alignment as hypothesis testing, Journal of computational biology, vol.18, issue.5, pp.677-691, 2011.

L. Nedialkova, M. Amat, I. Kevrekidis, and G. Hummer, Diffusion maps, clustering and fuzzy markov modeling in peptide folding transitions, The Journal of chemical physics, vol.141, issue.11, pp.9-611, 2014.

H. Nagamochi, T. Ibaraki, ;. E. Neveu, P. Popov, A. Hoffmann et al., RapidRMSD: Rapid determination of RMSDs corresponding to motions of flexible molecules, Mathematical Programming, vol.88, issue.3, pp.507-520, 2000.

B. Saul, C. D. Needleman, and . Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, vol.48, issue.3, pp.443-453, 1970.

O. Oliviero, S. Carugo, and . Pongor, A normalized root-mean-spuare distance for comparing protein three-dimensional structures, Protein science, vol.10, issue.7, pp.1470-1473, 2001.

K. Olechnovi?, E. Kulberkyt?-e, and C. Venclovas, CAD-score: A new contact area differencebased function for evaluation of protein structural models, Proteins: Structure, Function, and Bioinformatics, vol.81, issue.1, pp.149-162, 2013.

W. R. Pearson, Empirical statistical estimates for sequence similarity searches, Journal of molecular biology, vol.276, issue.1, pp.71-84, 1998.

M. Perutz, Stereochemistry of cooperative effects in haemoglobin1, From theoretical physics to biology, pp.247-285, 1973.

J. Pevsner, Bioinformatics and functional genomics, 2015.

C. Pál, B. Papp, and M. Lercher, An integrated view of protein evolution, Nature Reviews Genetics, vol.7, issue.5, p.337, 2006.

A. Gregory, D. Petsko, and . Ringe, Protein structure and function, 2008.

B. Phipson and G. K. Smyth, Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn, Statistical Applications in Genetics and Molecular Biology, vol.9, issue.1, 2010.

J. Pérez-vargas, T. Krey, C. Valansi, O. Ori, A. Avinoam et al., Structural basis of eukaryotic cell-cell fusion, Cell, vol.157, issue.2, pp.407-419, 2014.

C. H. Papadimitriou and M. Yannakakis, Optimization, approximation, and complexity classes, Journal of Computer and System Sciences, vol.43, issue.3, pp.425-440, 1991.

D. Ritchie, A. Ghoorah, L. Mavridis, V. Venkatraman, ;. F. Rey et al., Fast protein structure alignment using Gaussian overlap scoring of backbone peptide fragment similarity, Bioinformatics, vol.28, issue.24, p.291, 1995.
URL : https://hal.archives-ouvertes.fr/hal-00756813

F. M. Richards, Areas, volumes, packing and protein structure, Ann. Rev. Biophys. Bioeng, vol.6, pp.151-176, 1977.

. David-w-ritchie, Calculating and scoring high quality multiple flexible protein structure alignments, Bioinformatics, p.300, 2016.

A. Rodriguez and A. Laio, Clustering by fast search and find of density peaks, Science, vol.344, issue.6191, pp.1492-1496, 2014.

. Michael-g-rossmann, . Venigalla, and . Rao, Viral molecular machines, vol.726, 2011.

Y. Rubner, C. Tomasi, and L. J. Guibas, The earth mover's distance as a metric for image retrieval, International Journal of Computer Vision, vol.40, issue.2, pp.99-121, 2000.

A. Strehl and J. Ghosh, Cluster ensembles-a knowledge reuse framework for combining multiple partitions, Journal of machine learning research, vol.3, pp.583-617, 2002.

T. Shibuya, Efficient substructure RMSD query algorithms, Journal of Computational Biology, vol.14, issue.9, pp.1201-1207, 2007.

S. Simic, On a global upper bound for Jensen's inequality, Journal of Mathematical Analysis and Applications, vol.343, issue.1, pp.414-419, 2008.

F. Smith, E. Lattman, and C. Carter, The mutation ?99 Asp-Tyr stabilizes Y-A new, composite quaternary state of human hemoglobin, Proteins: Structure, Function, and Bioinformatics, vol.10, issue.2, pp.81-91, 1991.

J. Söding, Protein homology detection by hmm-hmm comparison, Bioinformatics, vol.21, issue.7, pp.951-960, 2004.

N. Shibayama, K. Sugiyama, J. Tame, and S. Park, Capturing the hemoglobin allosteric transition in a single crystal form, Journal of the American Chemical Society, vol.136, issue.13, pp.5097-5105, 2014.

B. Steipe, A revised proof of the metric properties of optimally superimposed vector sets

, Acta Crystallographica Section A: Foundations of Crystallography, vol.58, issue.5, pp.506-506, 2002.

H. Saran and V. Vazirani, Finding k-cuts within twice the optimal, SIAM J. Comp, vol.24, 1995.

T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology, vol.147, issue.1, pp.195-197, 1981.

F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus et al., Fast, scalable generation of highquality protein multiple sequence alignments using clustal omega, Molecular Systems Biology, vol.7, issue.1, 2011.

. Thê-a-uniprot?auniprot?-uniprot?a and . Consortium, Uniprot: the universal protein knowledgebase, Nucleic Acids Research, vol.45, issue.D1, pp.158-169, 2017.

A. Topchy, A. K. Jain, and W. Punch, Clustering ensembles: Models of consensus and weak partitions, IEEE transactions on pattern analysis and machine intelligence, vol.27, pp.1866-1881, 2005.

R. Tibshirani, G. Walther, T. Hastie, ;. Tang, L. Xie et al., On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.63, issue.2, pp.1043-1062, 2001.

S. Umeyama, Least-squares estimation of transformation parameters between two point patterns, IEEE Transactions, vol.13, issue.4, pp.376-380, 1991.

U. and V. Luxburg, A tutorial on spectral clustering, Statistics and Computing, vol.17, issue.4, pp.395-416, 2007.

J. M. White, S. E. Delos, M. Brecher, and K. Schornberg, Structures and mechanisms of viral membrane fusion proteins: multiple variations on a common theme, Critical reviews in biochemistry and molecular biology, vol.43, issue.3, pp.189-219, 2008.

W. Weissenhorn, A. Hinz, and Y. Gaudin, Virus membrane fusion, FEBS letters, vol.581, issue.11, pp.2150-2155, 2007.

L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational Biology, vol.1, issue.4, p.8790475, 1994.

J. Watson, R. Laskowski, and J. Thornton, Predicting protein function from sequence and structural data, Current opinion in structural biology, vol.15, issue.3, pp.275-284, 2005.
DOI : 10.1016/j.sbi.2005.04.003

I. Wohlers, N. Malod-dognin, R. Andonov, and G. Klau, CSA: comprehensive comparison of pairwise protein structure alignments, Nucleic acids research, vol.40, issue.W1, pp.303-309, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00667920

Z. Xiang, Advances in homology protein structure modeling, Current Protein and Peptide Science, vol.7, issue.3, pp.217-227, 2006.

R. Xu and D. Wunsch, Survey of clustering algorithms, IEEE Transactions on neural networks, vol.16, issue.3, pp.645-678, 2005.

Y. Ye and A. Godzik, Flexible structure alignment by chaining aligned fragment pairs allowing twists, Bioinformatics, vol.19, issue.2, pp.246-255, 2003.
DOI : 10.1093/bioinformatics/btg1086
URL : https://academic.oup.com/bioinformatics/article-pdf/19/suppl_2/ii246/435771/btg1086.pdf

Y. Yu and T. Hwa, Statistical significance of probabilistic sequence alignment and related local hidden markov models, Journal of Computational Biology, vol.8, issue.3, pp.249-282, 2001.

A. Zemla, LGA: a method for finding 3D similarities in protein structures, Nucleic acids research, vol.31, issue.13, pp.3370-3374, 2003.

D. Zhou, J. Li, and H. Zha, A new mallows distance based metric for comparing clusterings, ICML, pp.1028-1035, 2005.

L. Zhao, H. Nagamochi, and T. Ibaraki, Approximating the Minimum k-way Cut in a Graph via Minimum 3-way Cuts, 1999.

Y. Zhang and J. Skolnick, TM-align: a protein structure alignment algorithm based on the tm-score, Nucleic acids research, vol.33, issue.7, pp.2302-2309, 2005.

B. Table, 2 Statistical significance of our motifs, when compared against random motifs with two non parametric two-sample tests. Second column: p-value for the Wilcoxon Mann-Whitney U test

R. ,

R. Vs-dfv-flavi,

R. ,

. Sfv-alpha and . Vs-dfv-flavi,

. Sfv-alpha and . Vs-rbv-rubi,

. Sfv-alpha and . Vs-rvfv-phlebo,

. Method-executable-correspondence, When qualified by the suffix iter

A. Iter, Method SBL executable Option Align-Apurva-SFD sbl-structural-motifs-chains-apurva.exe Align-Apurva-CD sbl-structural-motifs-chains-apurva.exe-use-cd-filtration Align-Kpax-SFD sbl-structural-motifs-chains-kpax.exe Align-Kpax-CD sbl-structural-motifs-chains-kpax.exe-use-cd-filtration Align-Identity-SFD sbl-structural-motifs-conformations

, This section is devoted to the proof of Theorem 6.1. For the sake of readability, we splitted this proof into three parts: Theorems D.2, D.5 and D.6. Notice that the last two proofs are quite similar

, We say that ? L-reduces to ? is there are two polynomial-time algorithms f , g and constants ?, ? > 0 such that for each instance I of ?: 1. Algorithm f produces an instance I = f (I) of ? such that the optima of I and I

. , Given any solution of I with cost c , algorithm g produces a solution of I with cost c such that OP T ? (I) ? c ? ?(OP T ? (I ) ? c )

, It is known that if ? is AP X-hard and L-reduces to ? , then ? is AP X-hard as well, that case, ? does not admit a P T AS (Polynomial Time Approximation Scheme) unless P = N P

, For any D ? 2, the D-family-matching problem is AP X-hard even if the maximum degree ? is at most 4 and the weights are 2 and 5. In our reduction, we use a special case of set packing problem

,. .. and ,. .. , an integer k ? 1, set packing problem consists in determining whether there exists a packing C of size |C| = k. Set packing problem is NP-complete even if |Y i | = 3 for every i ? {1

, By Theorem 6.2, given D ? 1, there is an O(D 2 n)-time complexity algorithm for the Dfamily-matching problem because ? = 2. We prove in Lemma D.3 a better time complexity algorithm for the D-family-matching problem

G. )-for-paths, ). Let-d-?-n-+-;-v, E. , ). Then, ;. et al., Let E = {{v j , v j+1 } | 1 ? j ? n ? 1}. We define the function ? D as follows. For every t ? {1,. .. , n} and every i ? {max(1, t ? D),. .. , t + 1}, then ? D (v t , i) is the score of an optimal solution S of the D-family-matching problem, for the sub-path induced by the set of nodes {v 1, there exists an O(Dn)-time complexity algorithm for the D-family-matching problem for G. Proof of Lemma D.3. Let V = {v 1

D. ;. Claim and .. .. , For every i ? {max

,. .. , }. Max-;-?-d)-?-i-?-t, ;. .. That-{v-i, and .. , v t } is a set of this solution. We then modify this solution by adding node v t+1 in the last set, and we obtain the optimal solution for the D-family-matching problem, for the sub-path induced by the set of nodes {v i, p.1

?. D-(v-t-,-i and ). ,

, Any solution must contain the set {v t+1 }. Thus, we have to consider an optimal solution for the D-family-matching problem for the sub-path induced by the set of nodes {v i ,. .. , v t }. We now prove the result for ? D (v t+1 , t + 2), p.1

?. D-(v-n-,-i and ). ,

, Let D ? N +. Consider any intersection graph G = (V, E, w) that is an even cycle. Then, there exists an O(D 2 n)-time complexity algorithm for the Dfamily-matching problem for G

, Consider any instance of the D-family-matching problem such that: ? for every i ? {1,. .. , r}, there exist j 1 , j 2 ? {1,. .. , r } such that F i ? F j = ? for any j ? {1

?. and ,. .. , there exist i 1 , i 2 ? {1,. .. , r} such that F j ?F i = ? for any i ? {1

, D 2 )-time complexity algorithm for the D-family-matching problem. Say otherwise, Corollary D.2 shows that there is a polynomial time algorithm for the D-family-matching problem if any set in F ? F has a non-empty intersection with at most two other sets of F ? F, Then, there exists an O((r + r )

, Let T r be any spanning tree of G rooted at node r ? V. For every v ? V , we define H(G, T r , v) as the set of all H ? H(G, v) such that the graph induced by the set of nodes V (H) ? V (T v ) is a (connected) sub-tree rooted at v. Let H(G, T r ) = ? v?V H(G, T r , v), D.6 Appendix-Generic approach based on spanning trees Let us first introduce some notations. For every v ? V , let H(G, v) be the set of all different sub-graphs of G that contain v and of diameter at most D. Let H(G) = ? v?V H(G, v)

, Let N (v) = {v 1 ,. .. , v q } be the set of q ? 1 neighbors of v in T v. Suppose we have computed ? D (v j , H) for every j ? {1, A leaf is a node of degree one and different than the root r

, Algorithms based on spanning trees Proof of Lemma 6.5. For some k ? 1, consider an optimal solution S = {S 1 ,. .. , S k } for the D-familymatching problem for G. For every i ? {1

.. .. {1, By construction of T , S is an admissible solution for the D-family-matching problem for G, Let T be any rooted spanning tree of G such that E(T i ) ? E(T ) for every i ?

, Algorithm 1 returns ? D (G), that is an optimal solution for the D-family-matching problem for G, Given any positive integer D ? 1 and any intersection graph G

, Furthermore, the time complexity of Algorithm 1 is O(|T (G)| max Tr?T (G) h(G, T r ) ? n)

. D. Lemma, Let G be any intersection graph. Then, there exists a rooted spanning tree T of G

, For some k ? 1, consider an optimal solution S = {S 1 ,. .. , S k } for the 2-familymatching problem for G. For every i ? {1

?. E(t-)-for-every-i-?-{1 and .. .. , Indeed, since D = 2, G[S i ] is necessarily a complete bipartite graph and its number of nodes is at most 2?. It is sufficient to select the maximum star as T, Let T be any rooted spanning tree of G such that E(T i )

?. , ?. R(g,-?)-=-t-?-,-where-t-(g)-=-{t-1, ,. .. , T. |t-(g)|-}, A. {1 et al., D) returns ? D (T ? ) (Theorem 6.2). prove the result by induction. Clearly, ? 1,?1(G),?1(G) (1) = 0. Assume that we have computed ? y,x ? ,x + (D) for every D ?, Given any intersection graph G, Algorithm 1 returns a 2?-approximation for the 2-familymatching problem for G if: ? ?(M)

?. Consider-first-the-case-?-d+1-(g)-?]x and ?. , We necessarily have ? y,x ? ,x + (D +1) = ? y,x ? ,x + (D) because we cannot start a new plateau since x ? < ? D+1 (G) < x +

?. Assume-that-?-d+1-(g)-=-x-?-and-x-?-<-x-+, We cannot start a new plateau because x ? < x +. Thus we have to find the best y plateaus such that the lower bound is at least ? D+1 (G) = x ? and at most x +. We get that ? y,x ? ,x + (D + 1) = min x?P D+1

?. If and ?. D+1-(g)-=-x-+-and-x-?-<-x-+,

?. Consider-the-case-x-?-=-x-+-=-?-d+1, In the second case, the score is minimum score among all the optimal solutions composed of y ?1 plateaus

?. Thus, D + 1) is the minimum among these two scores

, D+1 (G) < x ? or ? D+1 (G) > x + , then there is no admissible solution and, by convention, ? If ?

?. {0, .. .. {1, .. .. Every-x-?-,-x-+-?-p-d+1-with-x-?-?-x-+-;-{0, .. .. {1, and .. .. , There are O(D 4 G ) such computations. All the cases (but the fourth), can be calculated in O(D G ) time. Thus, we get the O(D 5 G )-time complexity. Now consider the fourth case in which x ? = x +. Thus, for every D ?