Skip to Main content Skip to Navigation
Theses

Massively distributed learning in a Big Data environment

Abstract : In recent years, the amount of data analysed by companies and research laboratories increased strongly, opening the era of BigData. However, these raw data are frequently non-categorized and uneasy to use. This thesis aims to improve and ease the pre-treatment and comprehension of these big amount of data by using unsupervised machine learning algorithms.The first part of this thesis is dedicated to a state-of-the-art of clustering and biclustering algorithms and to an introduction to big data technologies. The first part introduces the conception of clustering Self-Organizing Map algorithm [Kohonen,2001] in big data environment. Our algorithm (SOM-MR) provides the same advantages as the original algorithm, namely the creation of data visualisation map based on data clusters. Moreover, it uses the Spark platform that makes it able to treat a big amount of data in a short time. Thanks to the popularity of this platform, it easily fits in many data mining environments. This is what we demonstrated it in our project \Square Predict" carried out in partnership with Axa insurance. The aim of this project was to provide a real-time data analysing platform in order to estimate the severity of natural disasters or improve residential risks knowledge. Throughout this project, we proved the efficiency of our algorithm through its capacity to analyse and create visualisation out of a big volume of data coming from social networks and open data.The second part of this work is dedicated to a new bi-clustering algorithm. BiClustering consists in making a cluster of observations and variables at the same time. In this contribution we put forward a new approach of bi-clustering based on the self-organizing maps algorithm that can scale on big amounts of data (BiTM-MR). To reach this goal, this algorithm is also based on a the Spark platform. It brings out more information than the SOM-MR algorithm because besides producing observation groups, it also associates variables to these groups,thus creating bi-clusters of variables and observations.
Keywords : Bi-Clustering
Complete list of metadatas

Cited literature [101 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02500012
Contributor : Abes Star :  Contact
Submitted on : Thursday, March 5, 2020 - 4:42:15 PM
Last modification on : Friday, March 6, 2020 - 2:16:48 AM
Long-term archiving on: : Saturday, June 6, 2020 - 4:32:13 PM

File

edgalilee_th_2018_sarazin1.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02500012, version 1

Collections

Citation

Tugdual Sarazin. Massively distributed learning in a Big Data environment. Databases [cs.DB]. Université Sorbonne Paris Cité, 2018. English. ⟨NNT : 2018USPCD050⟩. ⟨tel-02500012⟩

Share

Metrics

Record views

159

Files downloads

124