Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance

Abstract : The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks.
Document type :
Theses
Complete list of metadatas

Cited literature [134 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02152373
Contributor : Abes Star <>
Submitted on : Tuesday, June 11, 2019 - 1:44:29 PM
Last modification on : Tuesday, September 17, 2019 - 10:39:06 AM

File

edgalilee_th_2016_ghesmoune.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02152373, version 1

Collections

Citation

Mohammed Ghesmoune. Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance. Environnements Informatiques pour l'Apprentissage Humain. Université Sorbonne Paris Cité, 2016. Français. ⟨NNT : 2016USPCD061⟩. ⟨tel-02152373⟩

Share

Metrics

Record views

77

Files downloads

54