Skip to Main content Skip to Navigation

Motif extraction from complex data : case of protein classification

Abstract : The classification of biological data is one of the significant challenges inbioinformatics, as well for protein as for nucleic data. The presence of these data in hugemasses, their ambiguity and especially the high costs of the in vitro analysis in terms oftime and resources, make the use of data mining rather a necessity than a rational choice.However, the data mining techniques, which often process data under the relational format,are confronted with the inappropriate format of the biological data. Hence, an inevitablestep of pre-processing must be established.This thesis deals with the protein data preprocessing as a preparation step before theirclassification. We present motif extraction as a reliable way to address that task. The extractedmotifs are used as descriptors to encode proteins into feature vectors. This enablesthe use of known data mining classifiers which require this format. However, designing asuitable feature space, for a set of proteins, is not a trivial task.We deal with two kinds of protein data i:e:, sequences and tri-dimensional structures. In thefirst axis i:e:, protein sequences, we propose a novel encoding method that uses amino-acidsubstitution matrices to define similarity between motifs during the extraction step. Wedemonstrate the efficiency of such approach by comparing it with several encoding methods,using some classifiers. We also propose new metrics to study the robustness of some ofthese methods when perturbing the input data. These metrics allow to measure the abilityof the method to reveal any change occurring in the input data and also its ability to targetthe interesting motifs. The second axis is dedicated to 3D protein structures which are recentlyseen as graphs of amino acids. We make a brief survey on the most used graph-basedrepresentations and we propose a naïve method to help with the protein graph making. Weshow that some existing and widespread methods present remarkable weaknesses and do notreally reflect the real protein conformation. Besides, we are interested in discovering recurrentsub-structures in proteins which can give important functional and structural insights.We propose a novel algorithm to find spatial motifs from proteins. The extracted motifsmatch a well-defined shape which is proposed based on a biological basis. We compare withsequential motifs and spatial motifs of recent related works. For all our contributions, theoutcomes of the experiments confirm the efficiency of our proposed methods to representboth protein sequences and protein 3D structures in classification tasks.Software programs developed during this research work are available on my home page
Document type :
Complete list of metadatas

Cited literature [196 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, March 11, 2019 - 9:23:05 AM
Last modification on : Monday, January 18, 2021 - 10:34:38 PM
Long-term archiving on: : Wednesday, June 12, 2019 - 12:51:13 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02063250, version 1


Rabie Saidi. Motif extraction from complex data : case of protein classification. Bioinformatics [q-bio.QM]. Université Blaise Pascal - Clermont-Ferrand II, 2012. English. ⟨NNT : 2012CLF22272⟩. ⟨tel-02063250⟩



Record views


Files downloads