CURARE : curating and managing big data collections on the cloud

Gavin Kemp 1, 2
2 SOC - Service Oriented Computing
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : The emergence of new platforms for decentralized data creation, such as sensor and mobile platforms and the increasing availability of open data on the Web, is adding to the increase in the number of data sources inside organizations and brings an unprecedented Big Data to be explored. The notion of data curation has emerged to refer to the maintenance of data collections and the preparation and integration of datasets, combining them to perform analytics. Curation tasks include extracting explicit and implicit meta-data; semantic metadata matching and enrichment to add quality to the data. Next generation data management engines should promote techniques with a new philosophy to cope with the deluge of data. They should aid the user in understanding the data collections’ content and provide guidance to explore data. A scientist can stepwise explore into data collections and stop when the content and quality reach a satisfaction point. Our work adopts this philosophy and the main contribution is a data collections’ curation approach and exploration environment named CURARE. CURARE is a service-based system for curating and exploring Big Data. CURARE implements a data collection model that we propose, used for representing their content in terms of structural and statistical meta-data organised under the concept of view. A view is a data structure that provides an aggregated perspective of the content of a data collection and its several associated releases. CURARE provides tools focused on computing and extracting views using data analytics methods and also functions for exploring (querying) meta-data. Exploiting Big Data requires a substantial number of decisions to be performed by data analysts to determine which is the best way to store, share and process data collections to get the maximum benefit and knowledge from them. Instead of manually exploring data collections, CURARE provides tools integrated in an environment for assisting data analysts determining which are the best collections that can be used for achieving an analytics objective. We implemented CURARE and explained how to deploy it on the cloud using data science services on top of which CURARE services are plugged. We have conducted experiments to measure the cost of computing views based on datasets of Grand Lyon and Twitter to provide insight about the interest of our data curation approach and environment
Document type :
Theses
Complete list of metadatas

Cited literature [104 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02058604
Contributor : Abes Star <>
Submitted on : Wednesday, March 6, 2019 - 8:33:07 AM
Last modification on : Friday, May 17, 2019 - 10:32:40 AM
Long-term archiving on : Friday, June 7, 2019 - 2:35:44 PM

File

TH2018KEMPGAVIN.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02058604, version 1

Citation

Gavin Kemp. CURARE : curating and managing big data collections on the cloud. Databases [cs.DB]. Université de Lyon, 2018. English. ⟨NNT : 2018LYSE1179⟩. ⟨tel-02058604⟩

Share

Metrics

Record views

711

Files downloads

218