Designing scientific workflow following a structure and provenance-aware strategy

Abstract : Bioinformatics experiments are usually performed using scientific workflows in which tasks are chained together forming very intricate and nested graph structures. Scientific workflow systems have then been developed to guide users in the design and execution of workflows. An advantage of these systems over traditional approaches is their ability to automatically record the provenance (or lineage) of intermediate and final data products generated during workflow execution. The provenance of a data product contains information about how the product was derived, and it is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. For several reasons, the complexity of workflow and workflow execution structures is increasing over time, which has a clear impact on scientific workflows reuse.The global aim of this thesis is to enhance workflow reuse by providing strategies to reduce the complexity of workflow structures while preserving provenance. Two strategies are introduced.First, we propose an approach to rewrite the graph structure of any scientific workflow (classically represented as a directed acyclic graph (DAG)) into a simpler structure, namely, a series-parallel (SP) structure while preserving provenance. SP-graphs are simple and layered, making the main phases of workflow easier to distinguish. Additionally, from a more formal point of view, polynomial-time algorithms for performing complex graph-based operations (e.g., comparing workflows, which is directly related to the problem of subgraph homomorphism) can be designed when workflows have SP-structures while such operations are related to an NP-hard problem for DAG structures without any restriction on their structures. The SPFlow rewriting and provenance-preserving algorithm and its associated tool are thus introduced.Second, we provide a methodology together with a technique able to reduce the redundancy present in workflows (by removing unnecessary occurrences of tasks). More precisely, we detect "anti-patterns", a term broadly used in program design to indicate the use of idiomatic forms that lead to over-complicated design, and which should therefore be avoided. We thus provide the DistillFlow algorithm able to transform a workflow into a distilled semantically-equivalent workflow, which is free or partly free of anti-patterns and has a more concise and simpler structure.The two main approaches of this thesis (namely, SPFlow and DistillFlow) are based on a provenance model that we have introduced to represent the provenance structure of the workflow executions. The notion of provenance-equivalence which determines whether two workflows have the same meaning is also at the center of our work. Our solutions have been systematically tested on large collections of real workflows, especially from the Taverna system. Our approaches are available for use at https://www.lri.fr/~chenj/.
Document type :
Theses
Liste complète des métadonnées

Cited literature [86 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01074024
Contributor : Abes Star <>
Submitted on : Sunday, October 12, 2014 - 1:04:48 AM
Last modification on : Tuesday, April 24, 2018 - 1:38:15 PM
Document(s) archivé(s) le : Tuesday, January 13, 2015 - 10:11:02 AM

Identifiers

  • HAL Id : tel-01074024, version 1

Collections

Citation

Jiuqiang Chen. Designing scientific workflow following a structure and provenance-aware strategy. Other [cs.OH]. Université Paris Sud - Paris XI, 2013. English. ⟨NNT : 2013PA112221⟩. ⟨tel-01074024⟩

Share

Metrics

Record views

315

Files downloads

384