A nearest-neighbours kernel for classification : a case study of in situ two-dimensional plankton images with correction of total volume estimates for copepods

Cédric Dubois

Résumé

Plankton organisms are a key component of the biosphere: they are at the base of marine food webs and are important contributors to biogeochemical cycles, notably of carbon, nitrogen and oxygen. Indeed, phytoplankton captures carbon dioxide from the atmosphere and produces dioxygen; zooplankton contributes to aggregate and export this carbon at depth, where it is sequestered for hundreds of years. This so-called `biological carbon pump' is studied by ecologists to estimate its efficiency nowadays and in the future, in response to climate change. A modern approach consists in studying how the environment is linked with the functioning of ecosystems through `traits' (i.e., individual characteristics) of organisms. For example, a high correlation has been observed between the size distribution of zooplankters and the carbon sequestration efficiency. In situ imaging instruments and large image databases have been built for plankton, allowing taxonomic classification of organisms and quantification of the total volume of each group based on their morphology. The development of automated classification methods has been essential to help ecologists process data. Among them, Artificial Neural Networks (ANNs) have proven to be efficient and accurate, but their decisions are often hard to interpret. On one hand, in this thesis, we put forward the idea that following the transform-then-classify-simply approach of ANNs using a simple, explicit, transform can result in a classifier whose predictions are both interpretable (thus, trustable) and accurate. The proposed transform is defined as a linear combination of optimal, per-class targets, and the classification is performed, like with ANNs, by a nearest-target decision. Furthermore, as a main theoretical result, we establish that the proposed transform defines a kernel associated with the Weigthed-k-Nearest-Neighbor (W-kNN) classifier, and allows interpreting the W-kNN classifier as a member of a larger family of target-based classifiers, which satisfies an optimality criterion. We propose a modern W-kNN implementation of high enough computational efficiency to deal with large datasets, like the ones collected every day by plankton imaging instruments. We were therefore able to perform a leave-one-out cross-validation on large plankton images datasets. On another hand, we tackle the correction of the estimation of copepods volume from two-dimensional in situ images. Copepods are the most abundant zooplankton group and represent a significant share of the biomass of animals on Earth. The standard volume estimation methods are biased due to the effect of the projection onto the image plane. Two such methods exist: based on the Equivalent Spherical Diameter (ESD) and based on extending the best-fitting ellipse to 3D. We present a procedure for correcting the total volume estimations of both methods for this zooplankton group. First, the projection of the body of the copepod is robustly extracted. Second, we note that the exact projection of an ellipsoidal body model onto the image plane is an ellipse. Therefore, based on the simulation of many realistic ellipsoids (relying on shape distributions established from manual size measurements on a dataset) and their projections from random point of views, we can compute a total volume correction factor for each standard method. As opposed to a new volume estimation method from the images, the proposed correction factors allow improving the estimations of past studies, while being applicable to future studies as well. To validate the proposed method, we applied it to a database of 150,000 images of copepods captured by the UVP, and found that the corrections decreased the gap between the two standard methods by a factor of 50. The correction factors indicated that the ESD method tends to over-estimate the total volume by around 20% and the ellipse method under-estimates it by around 10%.

Les organismes qui composent le plancton sont des éléments essentiels de la biosphère : à la base de la chaine alimentaire marine, ils sont au cœur des cycles biogéochimiques, notamment du carbone, de l'azote et de l'oxygène. En effet, le phytoplancton capte le dioxyde de carbone de l'atmosphère et produit du dioxygène ; le zooplancton contribue à exporter ce carbone en profondeur. Les écologues étudient cette « pompe à carbone biologique », afin d'évaluer son efficacité actuelle et future face changement climatique. Une approche moderne consiste à étudier la manière dont l'environnement est lié au fonctionnement des écosystèmes par le biais des « traits » (caractéristiques individuelles) des organismes. Une corrélation importante a été observée entre la distribution des tailles des zooplanctons et l'efficacité de la séquestration du carbone. Des instruments d'imagerie in situ et de grands jeux de données d'images ont été mis en œuvre pour le plancton, permettant la classification taxonomique des organismes et la quantification du volume total par groupe. Le développement de méthodes de classification automatisée a été essentiel pour l'assistance au traitement des données. À ce titre, les Réseaux de Neurones Artificiels (RNAs) se sont avérés très utiles et précis, mais leurs décisions sont souvent difficiles à interpréter. Dans un premier temps, nous montrons que l'approche transformation-puis-classification-simple des RNAs avec une transformation simple et explicite, conduit à une méthode de classification dont les prédictions sont interprétables (donc fiables) et précises. La transformation proposée est définie comme une combinaison linéaire de cibles par classe. Ensuite, la classification est effectuée, comme avec les RNAs, en prenant la cible la plus proche. Notre résultat principal démontre que cette transformation définit un noyau associé au classifieur des k-plus-Proches-Voisins-Pondérés (kPPP). Ceci permet d'interpréter les kPPP comme un membre d'une famille plus large de classifieurs utilisant des cibles, qui satisfait un critère d'optimalité. Nous proposons une implémentation moderne des kPPP suffisamment efficace pour traiter de grands ensembles de données, tels que ceux collectés chaque jour par les instruments d'imagerie du plancton. Nous avons ainsi effectué une validation croisée avec l'omission d'un échantillon sur de grands jeux de données d'images de plancton. Dans un second temps, nous étudions l'estimation du volume des copépodes à partir d'images bidimensionnelles in situ. Les copépodes constituent le groupe zooplanctonique le plus abondant. Les deux méthodes standards d'estimation du volume sont biaisées en raison de l'effet de la projection sur le plan de l'image. L'une utilise le Diamètre Équivalent Sphérique (DES) et l'autre, l'ajustement d'une ellipse. Nous présentons une procédure pour corriger les estimations de volume total des deux méthodes pour ce groupe. La projection du corps du copépode seulement est extraite. Nous observons en outre que la projection exacte d'une ellipsoïde sur le plan est une ellipse. Par conséquent, à partir de la simulation de nombreuses ellipsoïdes réalistes (grâce à des mesures de taille manuelles) et de leurs projections selon une orientation aléatoire, nous calculons un facteur de correction du volume total par méthode. Contrairement à une nouvelle méthode d'estimation, les corrections proposées permettent d'améliorer les estimations des études passées, tout en étant applicables aux prochaines. À titre de validation, nous appliquons la procédure de correction aux estimations du volume total de 150 000 copépodes à partir d'images prises par un instrument in situ. Les facteurs corrections permettent de réduire l'écart entre les deux estimations d'un facteur 50, et indiquent que la méthode DES tend à surestimer le volume total d'environ 20 % et que celle utilisant l'ellipse tend à le sous-estimer d'environ 10 %.

A nearest-neighbours kernel for classification : a case study of in situ two-dimensional plankton images with correction of total volume estimates for copepods

Un noyau des plus proches voisins pour la classification : application aux images de plancton bidimensionnelles in situ avec correction des estimations de volume total pour les copépodes

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager