, Les observations qui appartiennent aux clusters de tailles inférieures à ce dernier sont considérées comme des outliers. Cette approche permet notamment d'éviter que des observations qui se retrouvent isolées dans l'espace forment des clusters, reproche souvent fait à l'algorithme single linkage. C'est en ce sens que nous appelons cet algorithme Robust single linkage clustering. Pour simplifier le rédaction, La nouveauté consiste à choisir la sous-partition de la classification hiérarchique qui maximise la taille du M-ème cluster
nous étudions aussi la vitesse à laquelle le risque de clustering (1.6) tend vers 0 sous certaines hypothèses concernant : ? la séparabilité et la régularité des supports ,
, ? la sparsité du modèle, c'est-à-dire le rapport entre la densité des observations dans les supports S i , i = 1
, clustering spectral) sur différents scénarios de simulation. La figure 1.5 présente les résultats de 4 approches sur les données de la figure 1.4. Sur cet exemple, on remarque clairement que le single linkage classique identifie deux clusters de très petites tailles, les autres observations sont mises dans un cluster unique. Les autres approches se comportent mieux avec une préférence pour le clustering spectral et l'approche que nous proposons qui, L'approche proposée est également comparée avec des méthodes classiques de clustering (kmeans
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Sciences, vol.96, pp.6745-6750, 1999. ,
Permutation importance: a corrected feature importance measure, Bioinformatics, vol.26, pp.1340-1347, 2010. ,
DOI : 10.1093/bioinformatics/btq134
URL : https://academic.oup.com/bioinformatics/article-pdf/26/10/1340/16892402/btq134.pdf
Empirical characterization of random forest variable importance measures, Computational Statistics & Data Analysis, vol.52, pp.2249-2260, 2008. ,
Clustering based on pairwise distances when the data is of mixed dimensions, IEEE Transaction on Information Theory, vol.57, pp.1692-1706, 2011. ,
Spectral clustering based on local linear approximations, Electronic Journal of Statistics, vol.5, pp.1537-1587, 2011. ,
DOI : 10.1214/11-ejs651
URL : https://doi.org/10.1214/11-ejs651
On clustering procedure and nonparametric mixture estimation, Electronic Journal of Statistics, vol.9, pp.266-297, 2015. ,
DOI : 10.1214/15-ejs995
URL : https://doi.org/10.1214/15-ejs995
Forest-rk: A new random forest induction method, International Conference on Intelligent Computing, pp.430-437, 2008. ,
DOI : 10.1007/978-3-540-85984-0_52
URL : https://hal.archives-ouvertes.fr/hal-00436367
Analysis of a random forests model, Journal of Machine Learning Research, vol.13, pp.1063-1095, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00704947
A graph-based estimator of the number of clusters, ESAIM. Probability and Statistics, vol.11, pp.272-280, 2007. ,
URL : https://hal.archives-ouvertes.fr/hal-00455749
Accelerated gradient boosting, 2018. ,
DOI : 10.1007/s10994-019-05787-1
URL : https://hal.archives-ouvertes.fr/hal-01723843
Cobra: A combined regression strategy, Journal of Multivariate Analysis, vol.146, pp.18-28, 2016. ,
DOI : 10.1016/j.jmva.2015.04.007
URL : https://hal.archives-ouvertes.fr/hal-01361789
A random forest guided tour, TEST, vol.25, pp.197-227, 2016. ,
DOI : 10.1007/s11749-016-0481-7
URL : https://hal.archives-ouvertes.fr/hal-01221748
Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations, Briefings in Bioinformatics, vol.13, pp.292-304, 2011. ,
DOI : 10.1093/bib/bbr053
URL : https://academic.oup.com/bib/article-pdf/13/3/292/679494/bbr053.pdf
Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol.2, pp.493-507, 2012. ,
A cart-based approach to discover emerging patterns in microarray data, Bioinformatics, vol.19, pp.2465-2472, 2003. ,
High-dimensional discriminant analysis, Communications in Statistics Theory and Methods, vol.36, pp.2607-2623, 2007. ,
URL : https://hal.archives-ouvertes.fr/inria-00548516
Bagging predictors. Machine learning, vol.24, pp.123-140, 1996. ,
DOI : 10.1007/bf00058655
URL : https://link.springer.com/content/pdf/10.1007%2FBF00058655.pdf
Random forests. Machine learning, vol.45, pp.5-32, 2001. ,
Classification and regression trees, 1984. ,
Multivariate decision trees, Machine learning, vol.19, pp.45-77, 1995. ,
DOI : 10.1007/bf00994660
URL : https://link.springer.com/content/pdf/10.1007%2FBF00994660.pdf
Selecting useful groups of features in a connectionist framework, IEEE transactions on neural networks, vol.19, pp.381-396, 2008. ,
Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research, vol.16, pp.321-357, 2002. ,
DOI : 10.1613/jair.953
URL : https://jair.org/index.php/jair/article/download/10302/24590
Random forests for classification in ecology, Ecology, vol.88, pp.2783-2792, 2007. ,
A probabilistic theory of pattern recognition, vol.31, 1996. ,
Gene selection and classification of microarray data using random forest, BMC Bioinformatics, vol.7, p.3, 2006. ,
Pattern classification, 2012. ,
Comparison of discrimination methods for the classification of tumors using gene expression data, vol.97, pp.77-87, 2002. ,
Measure theory and fine properties of functions, 2015. ,
Semi-supervised cluster analysis of imaging data, NeuroImage, vol.54, pp.2185-2197, 2011. ,
Experiments with a new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, pp.148-156, 1996. ,
Regularized discriminant analysis, Journal of the American Atatistical Association, vol.84, pp.165-175, 1989. ,
Forêts aléatoires: aspects théoriques, sélection de variables et applications, 2010. ,
Variance reduction in purely random forests, Journal of Nonparametric Statistics, vol.24, pp.543-562, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-01590513
Chapter 8: Arbres CART et Forêts aléatoires,Importance et sélection de variables, Apprentissage Statistique et Données Massives, pp.295-342, 2018. ,
Variable selection using random forests, Pattern Recognition Letters, vol.31, pp.2225-2236, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00755489
Supervised learning with decision tree-based methods in computational and systems biology, Mol. BioSyst, vol.5, pp.1593-1605, 2009. ,
Risk bounds for cart classifiers under a margin condition, Pattern Recognition, vol.45, pp.3523-3534, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00362281
Model selection for cart regression trees, IEEE Transactions on Information Theory, vol.51, pp.658-670, 2005. ,
URL : https://hal.archives-ouvertes.fr/hal-00326549
Importance des variables dans les méthodes CART, 2000. ,
Generalization bounds for random samples in Hilbert spaces, 2015. ,
URL : https://hal.archives-ouvertes.fr/tel-01774959
Kernel spectral clustering, 2016. ,
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, vol.286, pp.531-537, 1999. ,
Classification, 1999. ,
Correlation and variable importance in random forests, Statistics and Computing, pp.1-20, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00879978
Grouped variable importance with random forests and application to multiple functional data analysis, Computational Statistics & Data Analysis, vol.90, pp.15-35, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01935926
Sélection de groupes de variables corrélées en grande dimension, 2016. ,
Regularized linear discriminant analysis and its application in microarrays, Biostatistics, vol.8, pp.86-100, 2006. ,
Clustering algorithms, 1975. ,
The elements of statistical learning, 2009. ,
Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data, Journal of Experimental & Clinical Cancer Research, vol.28, p.149, 2009. ,
A selective review of group selection in highdimensional models, Statistical science: a review journal of the Institute of Mathematical Statistics, p.27, 2012. ,
Algorithms for clustering data, 1988. ,
Cluster analysis for gene expression data: A survey, IEEE Transactions on knowledge and data engineering, vol.16, pp.1370-1386, 2004. ,
Hierarchical clustering schemes, Psychometrika, vol.32, pp.241-254, 1967. ,
An exploratory technique for investigating large quantities of categorical data, Applied statistics, pp.119-127, 1980. ,
Optimization transfer using surrogate objective functions, Journal of Computational and Graphical statistics, vol.9, pp.1-20, 2000. ,
Application of independent component analysis to microarrays, Genome Biology, vol.4, p.76, 2003. ,
, , 2003.
, Multivariate decision trees using linear discriminants and tabu search, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol.33, pp.194-205
A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms, Machine Learning, vol.40, pp.203-228, 2000. ,
Fifty years of classification and regression trees, International Statistical Review, vol.82, pp.329-348, 2014. ,
Split selection methods for classification trees, Statistica sinica, vol.7, pp.815-840, 1997. ,
Some methods for classification and analysis of multivariate observations, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 14, pp.281-297, 1967. ,
Optimal construction of k-nearest-neigbor graphs for identifying noisy clusters, Theoritical Computer Science, vol.410, pp.1749-1764, 2009. ,
Mixture models: Inference and applications to clustering volume, Marcel Dekker, vol.84, 1988. ,
The group lasso for logistic regression, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.70, pp.53-71, 2008. ,
OC1: A randomized algorithm for building oblique decision trees, Proceedings of AAAI, vol.93, pp.322-327, 1993. ,
Fundamental limitations of spectral clustering, Advances in neural information processing systems, pp.1017-1024, 2007. ,
On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems, pp.849-856, 2002. ,
Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures, Briefings in Bioinformatics, vol.12, pp.369-373, 2011. ,
Collaborative fuzzy clustering, Pattern Recognition Letters, vol.23, pp.1675-1686, 2002. ,
Application of cart in ecological landscape mapping: Two case studies, Ecological Indicators, vol.11, pp.115-122, 2011. ,
Interpretable sparse sir for functional data, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01325090
Newer classification and regression tree techniques: Bagging and random forests for ecological prediction, Ecosystems, 2006. ,
DOI : 10.1007/s10021-005-0054-1
The use of cart and multivariate regression trees for supervised and unsupervised feature selection, vol.76, pp.45-54, 2005. ,
Induction of decision trees, Machine Learning, vol.1, pp.81-106, 1986. ,
C4.5: programs for machine learning, 1993. ,
Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, vol.66, pp.846-850, 1971. ,
Finding groups in data, 1990. ,
Application of cart algorithm in hepatitis disease diagnosis, Recent Trends in Information Technology (ICRTIT), 2011 International Conference on, pp.1283-1287, 2011. ,
Kernel based clustering and vector quantization for speech segmentation, Neural Networks, 2006. IJCNN'06. International Joint Conference on, pp.1636-1641, 2006. ,
Apprentissage et forêts aléatoires, 2015. ,
Consistency of random forests, Ann. Statist, vol.43, pp.1716-1741, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-00990008
Gene expression based leukemia subclassification using committee neural networks, Bioinformatics and Biology Insights, vol.3, p.89, 2009. ,
Sparse linear discriminant analysis by thresholding for high dimensional data, The Annals of Statistics, vol.39, pp.1241-1265, 2011. ,
Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nature medicine, vol.8, p.68, 2002. ,
, Empirical Processes with Applications to Statistics. SIAM, 1986.
Real-time human pose recognition in parts from single depth images, Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp.1297-1304, 2011. ,
Sparse decomposition and modeling of anatomical shape variation, IEEE Transactions on Medical Imaging, vol.26, pp.1625-1635, 2007. ,
Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, vol.8, p.25, 2007. ,
Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data, Bioinformatics, vol.23, pp.3170-3177, 2007. ,
Metagene projection for cross-platform, cross-species characterization of global transcriptional states, Proceedings of the National Academy of Sciences, vol.104, pp.5959-5964, 2007. ,
Asics: an automatic method for identification and quantification of metabolites in complex 1d 1h nmr spectra, Metabolomics, p.109, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01535613
Classification supervisée en grande dimension applicationa l'agrément de conduite automobile, Revue de Statistiques Appliquée, vol.54, pp.41-60, 2006. ,
The nature of statistical learning theory, 1995. ,
Statistical learning theory, 1998. ,
The structure of a gene co-expression network reveals biological functions underlying eqtls, PloS one, vol.8, p.60045, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00817655
A tutorial on spectral clustering, Statistics and computing, vol.17, pp.395-416, 2007. ,
Asymptotic theory for random forests, 2014. ,
Tree-structured classification via generalized discriminant analysis, Journal of the American Statistical Association, vol.83, pp.715-725, 1988. ,
DOI : 10.2307/2289295
Use of the zero-norm with linear models and kernel methods, Journal of machine learning research, vol.3, pp.1439-1461, 2003. ,
HHCART: an oblique decision tree, Computational Statistics & Data Analysis, vol.96, pp.12-23, 2016. ,
DOI : 10.1016/j.csda.2015.11.006
URL : http://arxiv.org/pdf/1504.03415
Penalized classification using Fisher's linear discriminant, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.73, pp.753-772, 2011. ,
DOI : 10.1111/j.1467-9868.2011.00783.x
URL : http://europepmc.org/articles/pmc3272679?pdf=render
Data Mining: Practical machine learning tools and techniques, 2016. ,
Modified linear discriminant analysis approaches for classification of high-dimensional microarray data, Computational Statistics & Data Analysis, vol.53, pp.1674-1687, 2009. ,
DOI : 10.1016/j.csda.2008.02.005
Protein network inference from multiple genomic data: a supervised approach, Bioinformatics, vol.20, pp.363-370, 2004. ,
DOI : 10.1093/bioinformatics/bth910
URL : https://hal.archives-ouvertes.fr/hal-00433586
Variable clustering in high dimensional linear regression models, Journal de la Societe Française de Statistique, vol.155, p.19, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00764927
Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.68, pp.49-67, 2006. ,
Clustering genes using heterogeneous data sources, Computational Knowledge Discovery for Bioinformatics Research, pp.67-83, 2012. ,
DOI : 10.4018/978-1-4666-1785-8.ch005
Variable selection for the multicategory svm via adaptive sup-norm regularization, Electron. J. Statist, vol.2, pp.149-167, 2008. ,