Skip to Main content Skip to Navigation

Active Data - Enabling Smart Data Life Cycle Management for Large Distributed Scientific Data Sets

Anthony Simonet 1, 2 
Abstract : In all domains, scientific progress relies more and more on our ability to exploit ever growing volumes of data. However, as datavolumes increase, their management becomes more difficult. A key point is to deal with the complexity of data life cycle management,i.e. all the operations that happen to data between their creation and there deletion: transfer, archiving, replication, disposal etc.These formerly straightforward operations become intractable when data volume grows dramatically, because of the heterogeneity ofdata management software on the one hand, and the complexity of the infrastructures involved on the other.In this thesis, we introduce Active Data, a meta-model, an implementation and a programming model that allow to represent formally and graphically the life cycle of data distributed in an assemblage of heterogeneous systems and infrastructures, naturally exposing replication, distribution and different data identifiers. Once connected to existing applications, Active Data exposes the progress of data through their life cycle at runtime to users and programs, while keeping their track as it passes from a system to another.The Active Data programming model allows to execute code at each step of the data life cycle. Programs developed with Active Datahave access at any time to the complete state of data in any system and infrastructure it is distributed to.We present micro-benchmarks and usage scenarios that demonstrate the expressivity of the programming model and the implementationquality. Finally, we describe the implementation of a Data Surveillance framework based on Active Data for theAdvanced Photon Source experiment that allows scientists to monitor the progress of their data, automate most manual tasks,get relevant notifications from huge amount of events, and detect and recover from errors without human intervention.This work provides interesting perspectives in data provenance and open data in particular, while facilitating collaboration betweenscientists from different communities.
Complete list of metadata

Cited literature [142 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, October 20, 2015 - 2:32:27 PM
Last modification on : Friday, September 30, 2022 - 4:12:10 AM
Long-term archiving on: : Friday, April 28, 2017 - 6:46:43 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01218016, version 1


Anthony Simonet. Active Data - Enabling Smart Data Life Cycle Management for Large Distributed Scientific Data Sets. Distributed, Parallel, and Cluster Computing [cs.DC]. Ecole normale supérieure de lyon - ENS LYON, 2015. English. ⟨NNT : 2015ENSL1004⟩. ⟨tel-01218016⟩



Record views


Files downloads