Parallel Itemset Mining in Massively Distributed Environments

Mehdi Zitouni 1, 2, 3
2 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : In the beginning of this thesis, we tackle the problem of CFI mining in big datasets. We adopt a prime-number-based approach to improve the performance of a parallel CFI mining process. We introduce Distributed-Closed-Itemset-Mining (DCIM), a parallel algorithm for mining CFIs from large amounts of data. DCIM allows discovering itemsets with better efficiency and result compactness. A key feature of DCIM is the combination of data mining properties with the principles of massive data distribution. Exhaustive experiments are carried out over real world datasets to illustrate the efficiency of DCIM for large real world datasets with up to 53 million documents. The second problem we address in this thesis is the discovery of maximally informative k-itemsets (miki) from a huge incoming/outgoing data over a stream based on joint entropy. We propose Parallel entropy computing over Streams (PentroS) a highly scalable, parallel miki mining algorithm that renders the mining process of the large throughput of data succinct and effective over a data streaming process. Its mining process is made up of only two efficient parallel jobs. With PentroS, we provide a set of significant optimizations for computing the joint entropy of the miki having different sizes, which drastically reduces the latency rate of the mining process. PentroS is extensively evaluated using a massive real-world data stream. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with lengthy itemsets and over very large throughput of data. Finally, we address the problem of parallel classification in highly distributed environments. We propose Ensemble of Ensembles of Classifiers (EEC), a parallel, scalable and highly accurate classifier algorithm. EEC renders a classification task simple, yet very efficient. Its working process is composed of two simple and compact jobs. Calling to more than one classifier, EEC cleverly exploits the parallelism setting not only to reduce the execution time but also to significantly improve the classification accuracy by performing two level decision making steps. We show that the EEC classification accuracy is improved by using informative patterns and that the classification error can be bounded to a small value. EEC is extensively evaluated using various real-world, large data sets. Our experimental results suggest that EEC is significantly more efficient and more accurate than alternative approaches.
Complete list of metadatas

Cited literature [127 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01953619
Contributor : Mehdi Zitouni <>
Submitted on : Friday, December 14, 2018 - 9:54:29 AM
Last modification on : Friday, May 17, 2019 - 11:39:42 AM
Long-term archiving on: Friday, March 15, 2019 - 1:13:44 PM

File

MehdiZitouni_thesis.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-01953619, version 2

Collections

Citation

Mehdi Zitouni. Parallel Itemset Mining in Massively Distributed Environments. Information Theory [cs.IT]. Université de Tunis El Manar; Inria, 2018. English. ⟨tel-01953619v2⟩

Share

Metrics

Record views

204

Files downloads

255