, Les observations qui appartiennent aux clusters de tailles inférieures à ce dernier sont considérées comme des outliers. Cette approche permet notamment d'éviter que des observations qui se retrouvent isolées dans l'espace forment des clusters, reproche souvent fait à l'algorithme single linkage. C'est en ce sens que nous appelons cet algorithme Robust single linkage clustering. Pour simplifier le rédaction, La nouveauté consiste à choisir la sous-partition de la classification hiérarchique qui maximise la taille du M-ème cluster

. Dans, nous étudions aussi la vitesse à laquelle le risque de clustering (1.6) tend vers 0 sous certaines hypothèses concernant : ? la séparabilité et la régularité des supports

, ? la sparsité du modèle, c'est-à-dire le rapport entre la densité des observations dans les supports S i , i = 1

, clustering spectral) sur différents scénarios de simulation. La figure 1.5 présente les résultats de 4 approches sur les données de la figure 1.4. Sur cet exemple, on remarque clairement que le single linkage classique identifie deux clusters de très petites tailles, les autres observations sont mises dans un cluster unique. Les autres approches se comportent mieux avec une préférence pour le clustering spectral et l'approche que nous proposons qui, L'approche proposée est également comparée avec des méthodes classiques de clustering (kmeans

U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra et al., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Sciences, vol.96, pp.6745-6750, 1999.

A. Altmann, L. Tolo?i, O. Sander, and T. Lengauer, Permutation importance: a corrected feature importance measure, Bioinformatics, vol.26, pp.1340-1347, 2010.
DOI : 10.1093/bioinformatics/btq134
URL : https://academic.oup.com/bioinformatics/article-pdf/26/10/1340/16892402/btq134.pdf

K. J. Archer and R. V. Kimes, Empirical characterization of random forest variable importance measures, Computational Statistics & Data Analysis, vol.52, pp.2249-2260, 2008.

E. Arias-castro, Clustering based on pairwise distances when the data is of mixed dimensions, IEEE Transaction on Information Theory, vol.57, pp.1692-1706, 2011.

E. Arias-castro, G. Chen, and G. Lerman, Spectral clustering based on local linear approximations, Electronic Journal of Statistics, vol.5, pp.1537-1587, 2011.
DOI : 10.1214/11-ejs651
URL : https://doi.org/10.1214/11-ejs651

S. Auray, N. Klutchnikoff, and L. Rouvière, On clustering procedure and nonparametric mixture estimation, Electronic Journal of Statistics, vol.9, pp.266-297, 2015.
DOI : 10.1214/15-ejs995
URL : https://doi.org/10.1214/15-ejs995

S. Bernard, L. Heutte, and S. Adam, Forest-rk: A new random forest induction method, International Conference on Intelligent Computing, pp.430-437, 2008.
DOI : 10.1007/978-3-540-85984-0_52
URL : https://hal.archives-ouvertes.fr/hal-00436367

G. Biau, Analysis of a random forests model, Journal of Machine Learning Research, vol.13, pp.1063-1095, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00704947

G. Biau, B. Cadre, and B. Pelletier, A graph-based estimator of the number of clusters, ESAIM. Probability and Statistics, vol.11, pp.272-280, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00455749

G. Biau, B. Cadre, and L. Rouvìère, Accelerated gradient boosting, 2018.
DOI : 10.1007/s10994-019-05787-1
URL : https://hal.archives-ouvertes.fr/hal-01723843

G. Biau, A. Fischer, B. Guedj, and J. D. Malley, Cobra: A combined regression strategy, Journal of Multivariate Analysis, vol.146, pp.18-28, 2016.
DOI : 10.1016/j.jmva.2015.04.007
URL : https://hal.archives-ouvertes.fr/hal-01361789

G. Biau and E. Scornet, A random forest guided tour, TEST, vol.25, pp.197-227, 2016.
DOI : 10.1007/s11749-016-0481-7
URL : https://hal.archives-ouvertes.fr/hal-01221748

A. Boulesteix, A. Bender, J. Lorenzo-bermejo, and C. Strobl, Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations, Briefings in Bioinformatics, vol.13, pp.292-304, 2011.
DOI : 10.1093/bib/bbr053
URL : https://academic.oup.com/bib/article-pdf/13/3/292/679494/bbr053.pdf

A. Boulesteix, S. Janitza, J. Kruppa, and I. R. König, Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol.2, pp.493-507, 2012.

A. Boulesteix, G. Tutz, and K. Strimmer, A cart-based approach to discover emerging patterns in microarray data, Bioinformatics, vol.19, pp.2465-2472, 2003.

C. Bouveyron, S. Girard, and C. Schmid, High-dimensional discriminant analysis, Communications in Statistics Theory and Methods, vol.36, pp.2607-2623, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00548516

L. Breiman, Bagging predictors. Machine learning, vol.24, pp.123-140, 1996.
DOI : 10.1007/bf00058655
URL : https://link.springer.com/content/pdf/10.1007%2FBF00058655.pdf

L. Breiman, Random forests. Machine learning, vol.45, pp.5-32, 2001.

L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees, 1984.

C. E. Brodley and P. E. Utgoff, Multivariate decision trees, Machine learning, vol.19, pp.45-77, 1995.
DOI : 10.1007/bf00994660
URL : https://link.springer.com/content/pdf/10.1007%2FBF00994660.pdf

D. Chakraborty and N. R. Pal, Selecting useful groups of features in a connectionist framework, IEEE transactions on neural networks, vol.19, pp.381-396, 2008.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research, vol.16, pp.321-357, 2002.
DOI : 10.1613/jair.953
URL : https://jair.org/index.php/jair/article/download/10302/24590

D. R. Cutler, T. C. Edwards, K. H. Beard, A. Cutler, K. T. Hess et al., Random forests for classification in ecology, Ecology, vol.88, pp.2783-2792, 2007.

L. Devroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition, vol.31, 1996.

R. Díaz-uriarte and S. Alvarez-de-andrés, Gene selection and classification of microarray data using random forest, BMC Bioinformatics, vol.7, p.3, 2006.

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2012.

S. Dudoit, J. Fridlyand, and T. P. Speed, Comparison of discrimination methods for the classification of tumors using gene expression data, vol.97, pp.77-87, 2002.

L. C. Evans and R. F. Gariepy, Measure theory and fine properties of functions, 2015.

R. Filipovych, S. M. Resnick, and C. Davatzikos, Semi-supervised cluster analysis of imaging data, NeuroImage, vol.54, pp.2185-2197, 2011.

Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, pp.148-156, 1996.

J. H. Friedman, Regularized discriminant analysis, Journal of the American Atatistical Association, vol.84, pp.165-175, 1989.

R. Genuer, Forêts aléatoires: aspects théoriques, sélection de variables et applications, 2010.

R. Genuer, Variance reduction in purely random forests, Journal of Nonparametric Statistics, vol.24, pp.543-562, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01590513

R. Genuer and J. Poggi, Chapter 8: Arbres CART et Forêts aléatoires,Importance et sélection de variables, Apprentissage Statistique et Données Massives, pp.295-342, 2018.

R. Genuer, J. Poggi, and C. Tuleau-malot, Variable selection using random forests, Pattern Recognition Letters, vol.31, pp.2225-2236, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00755489

P. Geurts, A. Irrthum, and L. Wehenkel, Supervised learning with decision tree-based methods in computational and systems biology, Mol. BioSyst, vol.5, pp.1593-1605, 2009.

S. Gey, Risk bounds for cart classifiers under a margin condition, Pattern Recognition, vol.45, pp.3523-3534, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00362281

S. Gey and E. Nedelec, Model selection for cart regression trees, IEEE Transactions on Information Theory, vol.51, pp.658-670, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00326549

B. Ghattas, Importance des variables dans les méthodes CART, 2000.

I. Giulini, Generalization bounds for random samples in Hilbert spaces, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01774959

I. Giulini, Kernel spectral clustering, 2016.

T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, vol.286, pp.531-537, 1999.

A. D. Gordon, Classification, 1999.

B. Gregorutti, B. Michel, and P. Saint-pierre, Correlation and variable importance in random forests, Statistics and Computing, pp.1-20, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00879978

B. Gregorutti, B. Michel, and P. Saint-pierre, Grouped variable importance with random forests and application to multiple functional data analysis, Computational Statistics & Data Analysis, vol.90, pp.15-35, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01935926

Q. Grimonprez, Sélection de groupes de variables corrélées en grande dimension, 2016.

Y. Guo, T. Hastie, and R. Tibshirani, Regularized linear discriminant analysis and its application in microarrays, Biostatistics, vol.8, pp.86-100, 2006.

J. A. Hartigan, Clustering algorithms, 1975.

T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, 2009.

D. Huang, Y. Quan, M. He, and B. Zhou, Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data, Journal of Experimental & Clinical Cancer Research, vol.28, p.149, 2009.

J. Huang, P. Breheny, and S. Ma, A selective review of group selection in highdimensional models, Statistical science: a review journal of the Institute of Mathematical Statistics, p.27, 2012.

A. K. Jain and R. C. Dubes, Algorithms for clustering data, 1988.

D. Jiang, C. Tang, and A. Zhang, Cluster analysis for gene expression data: A survey, IEEE Transactions on knowledge and data engineering, vol.16, pp.1370-1386, 2004.

S. C. Johnson, Hierarchical clustering schemes, Psychometrika, vol.32, pp.241-254, 1967.

G. V. Kass, An exploratory technique for investigating large quantities of categorical data, Applied statistics, pp.119-127, 1980.

K. Lange, D. R. Hunter, and I. Yang, Optimization transfer using surrogate objective functions, Journal of Computational and Graphical statistics, vol.9, pp.1-20, 2000.

S. Lee and S. Batzoglou, Application of independent component analysis to microarrays, Genome Biology, vol.4, p.76, 2003.

X. Li, J. R. Sweigart, J. T. Teng, J. M. Donohue, L. A. Thombs et al., , 2003.

, Multivariate decision trees using linear discriminants and tabu search, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol.33, pp.194-205

T. Lim, W. Loh, and Y. Shih, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms, Machine Learning, vol.40, pp.203-228, 2000.

W. Loh, Fifty years of classification and regression trees, International Statistical Review, vol.82, pp.329-348, 2014.

W. Loh and Y. Shih, Split selection methods for classification trees, Statistica sinica, vol.7, pp.815-840, 1997.

J. Macqueen, Some methods for classification and analysis of multivariate observations, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 14, pp.281-297, 1967.

M. Maier, M. Hein, and U. Luxburg, Optimal construction of k-nearest-neigbor graphs for identifying noisy clusters, Theoritical Computer Science, vol.410, pp.1749-1764, 2009.

G. J. Mclachlan and K. E. Basford, Mixture models: Inference and applications to clustering volume, Marcel Dekker, vol.84, 1988.

L. Meier, S. V. Geer, and P. Bühlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.70, pp.53-71, 2008.

S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, OC1: A randomized algorithm for building oblique decision trees, Proceedings of AAAI, vol.93, pp.322-327, 1993.

B. Nadler and M. Galun, Fundamental limitations of spectral clustering, Advances in neural information processing systems, pp.1017-1024, 2007.

A. Y. Ng, M. I. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems, pp.849-856, 2002.

K. K. Nicodemus, Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures, Briefings in Bioinformatics, vol.12, pp.369-373, 2011.

W. Pedrycz, Collaborative fuzzy clustering, Pattern Recognition Letters, vol.23, pp.1675-1686, 2002.

R. Pesch, G. Schmidt, W. Schroeder, and I. Weustermann, Application of cart in ecological landscape mapping: Two case studies, Ecological Indicators, vol.11, pp.115-122, 2011.

V. Picheny, R. Servien, and N. Villa-vialaneix, Interpretable sparse sir for functional data, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01325090

A. M. Prasad, L. R. Iverson, and A. Liaw, Newer classification and regression tree techniques: Bagging and random forests for ecological prediction, Ecosystems, 2006.
DOI : 10.1007/s10021-005-0054-1

F. Questier, R. Put, D. Coomans, B. Walczak, and Y. V. Heyden, The use of cart and multivariate regression trees for supervised and unsupervised feature selection, vol.76, pp.45-54, 2005.

J. R. Quinlan, Induction of decision trees, Machine Learning, vol.1, pp.81-106, 1986.

J. R. Quinlan, C4.5: programs for machine learning, 1993.

W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, vol.66, pp.846-850, 1971.

P. J. Rousseeuw and L. Kaufman, Finding groups in data, 1990.

G. Sathyadevi, Application of cart algorithm in hepatitis disease diagnosis, Recent Trends in Information Technology (ICRTIT), 2011 International Conference on, pp.1283-1287, 2011.

D. S. Satish and C. C. Sekhar, Kernel based clustering and vector quantization for speech segmentation, Neural Networks, 2006. IJCNN'06. International Joint Conference on, pp.1636-1641, 2006.

E. Scornet, Apprentissage et forêts aléatoires, 2015.

E. Scornet, G. Biau, and J. Vert, Consistency of random forests, Ann. Statist, vol.43, pp.1716-1741, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00990008

M. S. Sewak, N. P. Reddy, and Z. Duan, Gene expression based leukemia subclassification using committee neural networks, Bioinformatics and Biology Insights, vol.3, p.89, 2009.

J. Shao, Y. Wang, X. Deng, and S. Wang, Sparse linear discriminant analysis by thresholding for high dimensional data, The Annals of Statistics, vol.39, pp.1241-1265, 2011.

M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok et al., Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nature medicine, vol.8, p.68, 2002.

R. Shorack and J. Wellner, Empirical Processes with Applications to Statistics. SIAM, 1986.

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio et al., Real-time human pose recognition in parts from single depth images, Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp.1297-1304, 2011.

K. Sjostrand, E. Rostrup, C. Ryberg, R. Larsen, C. Studholme et al., Sparse decomposition and modeling of anatomical shape variation, IEEE Transactions on Medical Imaging, vol.26, pp.1625-1635, 2007.

C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn, Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, vol.8, p.25, 2007.

F. Tai and W. Pan, Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data, Bioinformatics, vol.23, pp.3170-3177, 2007.

P. Tamayo, D. Scanfeld, B. L. Ebert, M. A. Gillette, C. W. Roberts et al., Metagene projection for cross-platform, cross-species characterization of global transcriptional states, Proceedings of the National Academy of Sciences, vol.104, pp.5959-5964, 2007.

P. J. Tardivel, C. Canlet, G. Lefort, M. Tremblay-franco, L. Debrauwer et al., Asics: an automatic method for identification and quantification of metabolites in complex 1d 1h nmr spectra, Metabolomics, p.109, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01535613

C. Tuleau and J. Poggi, Classification supervisée en grande dimension applicationa l'agrément de conduite automobile, Revue de Statistiques Appliquée, vol.54, pp.41-60, 2006.

V. Vapnik, The nature of statistical learning theory, 1995.

V. Vapnik, Statistical learning theory, 1998.

N. Villa-vialaneix, L. Liaubet, T. Laurent, P. Cherel, A. Gamot et al., The structure of a gene co-expression network reveals biological functions underlying eqtls, PloS one, vol.8, p.60045, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00817655

V. Luxburg and U. , A tutorial on spectral clustering, Statistics and computing, vol.17, pp.395-416, 2007.

S. Wager, Asymptotic theory for random forests, 2014.

W. Loh and N. V. , Tree-structured classification via generalized discriminant analysis, Journal of the American Statistical Association, vol.83, pp.715-725, 1988.
DOI : 10.2307/2289295

J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping, Use of the zero-norm with linear models and kernel methods, Journal of machine learning research, vol.3, pp.1439-1461, 2003.

D. Wickramarachchi, B. Robertson, M. Reale, C. Price, and J. Brown, HHCART: an oblique decision tree, Computational Statistics & Data Analysis, vol.96, pp.12-23, 2016.
DOI : 10.1016/j.csda.2015.11.006
URL : http://arxiv.org/pdf/1504.03415

D. M. Witten and R. Tibshirani, Penalized classification using Fisher's linear discriminant, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.73, pp.753-772, 2011.
DOI : 10.1111/j.1467-9868.2011.00783.x
URL : http://europepmc.org/articles/pmc3272679?pdf=render

I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques, 2016.

P. Xu, G. N. Brock, and R. S. Parrish, Modified linear discriminant analysis approaches for classification of high-dimensional microarray data, Computational Statistics & Data Analysis, vol.53, pp.1674-1687, 2009.
DOI : 10.1016/j.csda.2008.02.005

Y. Yamanishi, J. Vert, and M. Kanehisa, Protein network inference from multiple genomic data: a supervised approach, Bioinformatics, vol.20, pp.363-370, 2004.
DOI : 10.1093/bioinformatics/bth910
URL : https://hal.archives-ouvertes.fr/hal-00433586

L. Yengo, J. Jacques, and C. Biernacki, Variable clustering in high dimensional linear regression models, Journal de la Societe Française de Statistique, vol.155, p.19, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00764927

M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.68, pp.49-67, 2006.

E. Zeng, C. Yang, T. Li, and G. Narasimhan, Clustering genes using heterogeneous data sources, Computational Knowledge Discovery for Bioinformatics Research, pp.67-83, 2012.
DOI : 10.4018/978-1-4666-1785-8.ch005

H. H. Zhang, Y. Liu, Y. Wu, and J. Zhu, Variable selection for the multicategory svm via adaptive sup-norm regularization, Electron. J. Statist, vol.2, pp.149-167, 2008.