Skip to Main content Skip to Navigation

Développements méthodologiques autour de l'analyse des données de metabarcoding ADN

Abstract : This thesis positions itself in the context of the processing of high-throughput sequencing data, and specifically DNA metabarcoding data. DNA metabarcoding consists of the identification of taxa or taxonomic groups from DNA extracted from environmental samples (water, soil, animal feces). After extraction of the DNA, short sequences used as taxonomic markers are amplified by PCR, then sequenced using high-throughput sequencing technologies. Important volumes of data are produced that way, usually from several thousands to several hundreds of thousands sequences per sample. This thesis aimed for the development of methods for the analysis of these sequences. Classification methods allow the treatment of numerous problems in DNA metabarcoding. Supervised classification is used for the taxonomic assignment of sequences to taxa, by comparing them to the sequences of a reference database. Unsupervised classification methods are used to create taxonomic groups (MOTUs) from the sequences, in order to estimate biodiversity. They are also used to identify the erroneous sequences generated during the PCR and sequencing steps in particular, where erroneous sequences often derive from true sequences and remain very close to them. Classification approaches used in the context of DNA metabarcoding necessitate a sequence comparison method that should be both fast and exact. Such a method was developed, using a Needleman-Wunsch type global alignment algorithm computing the length of the longest common subsequence between the two sequences being aligned, associated with a lossless filter allowing to avoid the alignment of some pairs of sequences that have no chance to present a similarity superior to a chosen threshold. The use of Single Instruction, Multiple Data instructions, as well as the availability of multithreading speed up the calculations. This comparison method is implanted in SUMATRA, a program computing all the pairwise similarities of a dataset or between two datasets, with the possibility to set a threshold under which similarities are ignored. It is also used in SUMACLUST, a program grouping sequences using a star clustering algorithm, where each cluster possesses a representative sequence. This algorithm can be used to generate MOTUs, or to identify erroneous sequences deriving from true sequences, by using the fact that true sequences tend to end up as the representative sequences of their cluster. More specialized, the SUMACLEAN program was developed to identify sequences containing ponctual PCR errors. To that end, directed acyclic graphs are created, whose topology matches perfectly the successions of errors generated by ponctual errors during PCR. A new approach for the taxonomic assignment of sequences with a supervised classification method was also studied. Nowadays, most taxononomic assignment approaches use methods that are badly suited for the important polymorphism of markers, and don't take in account enough the incompleteness and errors inherent to reference databases. A new approach was tested, based on the idea of a start from the root of the taxonomic tree, and a descent in it with a possible stop before reaching a leaf, if descending to a more precise taxonomic level seems unreasonable. This approach would theoretically allow for a better handling of the problems inherent to reference databases, but poses a few issues, such as the representation of sequences at intermediate tree levels, and the model used to make choices regarding the path to take in the tree, for which no satisfying solutions have been found yet.
Document type :
Complete list of metadatas
Contributor : Abes Star :  Contact
Submitted on : Tuesday, January 16, 2018 - 3:38:10 PM
Last modification on : Friday, July 17, 2020 - 8:26:03 AM
Long-term archiving on: : Tuesday, May 8, 2018 - 1:14:09 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01685615, version 1



Celine Mercier. Développements méthodologiques autour de l'analyse des données de metabarcoding ADN. Génétique des plantes. Université Grenoble Alpes, 2015. Français. ⟨NNT : 2015GREAV060⟩. ⟨tel-01685615⟩



Record views


Files downloads