Designing scientific workflows following a structure and provenance-aware strategy

Jiuqiang Chen 1, 2
2 AMIB - Algorithms and Models for Integrative Biology
LIX - Laboratoire d'informatique de l'École polytechnique [Palaiseau], LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France
Abstract : Scientific workflow systems are equipped of provenance modules able to collect data produced and consumed during workflow runs to enhance reproducibility. For several reasons, the complexity of workflow and workflow execution structures is increasing over time, with a clear impact on scientific workflows reuse. The global aim of this thesis is to enhance workflow reuse by providing strategies to reduce the complexity of workflow structures while preserving provenance. Two strategies are introduced. First, we propose an approach to rewrite any scientific workflow (represented as a directed acyclic graph (DAG)) into a series-parallel (SP) structure while preserving provenance. Such structures allow to design polynomial-time algorithms for complex workflow operations (e.g., comparing workflows) while such operations are related to an NP-hard problem for general DAG structures. The SPFlow rewriting and provenance-preserving algorithm is thus introduced. Second, we provide a methodology and a technique to reduce the redundancy present in workflows by detecting and removing "anti-patterns" responsible for such redundancy. The DistillFlow algorithm is able to transform a workflow into a distilled semantically-equivalent workflow, free or partly free of anti-patterns and with a more concise and simpler structure. The two main approaches (SPFlow and DistillFlow) are based on a provenance model that we have introduced to represent the provenance structure of the workflow executions. Our solutions are available for use at https://www.lri.fr/~chenj. They have been systematically tested on large collections of real workflows, especially from the Taverna system.
Complete list of metadatas

Cited literature [87 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00931122
Contributor : Sarah Cohen-Boulakia <>
Submitted on : Tuesday, January 14, 2014 - 10:37:14 PM
Last modification on : Monday, December 9, 2019 - 5:24:07 PM
Long-term archiving on: Tuesday, April 15, 2014 - 4:28:54 PM

Identifiers

  • HAL Id : tel-00931122, version 1

Collections

Citation

Jiuqiang Chen. Designing scientific workflows following a structure and provenance-aware strategy. Databases [cs.DB]. Université Paris Sud - Paris XI, 2013. English. ⟨tel-00931122⟩

Share

Metrics

Record views

529

Files downloads

395