Skip to Main content Skip to Navigation
Theses

Mining and ranking closed itemsets from large-scale transactional datasets

Abstract : The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a "long tail" distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.To address the items' coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules' quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items' associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems.
Complete list of metadatas

Cited literature [73 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01699907
Contributor : Abes Star :  Contact
Submitted on : Friday, February 2, 2018 - 7:42:06 PM
Last modification on : Wednesday, October 14, 2020 - 4:03:45 AM
Long-term archiving on: : Wednesday, May 2, 2018 - 10:42:24 PM

File

KIRCHGESSNER_2016_diffusion.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01699907, version 1

Collections

Citation

Martin Kirchgessner. Mining and ranking closed itemsets from large-scale transactional datasets. Data Structures and Algorithms [cs.DS]. Université Grenoble Alpes, 2016. English. ⟨NNT : 2016GREAM060⟩. ⟨tel-01699907⟩

Share

Metrics

Record views

493

Files downloads

1003