# Agglomerative 2-3 Hierarchical Classification: Theoretical and Applicative Study

1 AxIS - Usage-centered design, analysis and improvement of information systems
CRISAM - Inria Sophia Antipolis - Méditerranée , Inria Paris-Rocquencourt
Abstract : Classification is one of the many fields in Data Mining which aims at extracting information from large data volumes by using different computational techniques from machine learning, statistics and pattern recognition. One of the two common approaches in the unsupervised classification (or clustering) is the hierarchical clustering. Its purpose is to produce a tree in which the nodes represent clusters of the initial analyzed data. One of the main drawbacks of the most known and used hierarchical agglomerative method, the Agglomerative Hierarchical Classification (AHC), is the fact that it cannot highlight groups of objects with characteristics from two or more classes, property found for example in overlapping clusters.

This thesis deals with a recent extension of the Agglomerative Hierarchical Classification, the Agglomerative 2-3 Hierarchical Classification (2-3 AHC), proposed by P. Bertrand in 2002, with a focus on its application to the Data Mining fields.
The three major contributions of this thesis are: the theoretical study of the 2-3 hierarchies (also called paired hierarchies), the new 2-3 AHC algorithm and its implementation, and the first applicative study of this method in two Data Mining fields.

Our theoretical study includes the discovery of four new theoretical properties of the 2-3 hierarchies and the definition of the aggregation links between clusters for this type of structure. This allowed us to highlight a special case of clusters merging and to introduce an intermediate step in the 2-3 hierarchies' construction. The systematic and exhaustive study of possible cases, lead us to formulate the best choices in term of linkage and structure indexing, in order to improve the quality of the 2-3 hierarchies.

Next, based on our theoretical study and contributions, we proposed a new general Agglomerative 2-3 Hierarchical Classification algorithm. This represents the result of our precedent study: a powerful algorithm exploring the multiple possibilities of the 2-3 hierarchical model. A theoretical complexity analysis of our 2-3 AHC algorithm, showed a reduced complexity from O(n3) in the initial algorithm, to O(n2 log n) for our algorithm. The tests on different datasets (real and generated) confirmed our theoretical complexity study. Very satisfying results were obtained by analyzing the "quality" of the 2-3 hierarchies compared with the traditional hierarchies: up to 50% more created clusters and a maximal gain of 84% using the Stress index.

We also proposed an object-oriented model of our algorithm that was integrated in the Hierarchical Clustering Toolbox'' (HCT), a toolbox that we developed for the visualization of the agglomerative hierarchical classification methods. We also integrated this model as a method of case indexing in the Case Based Reasoning platform, CBR*Tools, developed at INRIA Sophia Antipolis, and used it to design recommender systems.

Our last contribution lies in the first study of the applicability of the 2-3 AHC on real data from two Data Mining fields: Web Mining and XML Document Clustering. This study lead to interesting results and was based on the comparison of the 2-3 hierarchical clustering of INRIA's research teams using either the users' behavior on their Web sites, or their XML annual reports, with the existent structure of the research themes organization.

Finally, to conclude, we show that this subject is far from being exhausted and we propose several research perspectives related to the Agglomerative 2-3 Hierarchical Classification and to our HCT toolbox, developed during this thesis.
Keywords :
Document type :
Theses

Cited literature [115 references]

https://tel.archives-ouvertes.fr/tel-00156809
Contributor : Sergiu Chelcea <>
Submitted on : Friday, June 22, 2007 - 4:30:52 PM
Last modification on : Friday, May 25, 2018 - 12:02:04 PM
Long-term archiving on: : Thursday, April 8, 2010 - 5:51:21 PM

### Identifiers

• HAL Id : tel-00156809, version 1

### Citation

Sergiu Chelcea. Agglomerative 2-3 Hierarchical Classification: Theoretical and Applicative Study. Human-Computer Interaction [cs.HC]. Université Nice Sophia Antipolis, 2007. English. ⟨tel-00156809⟩

Record views