Skip to Main content Skip to Navigation

A science-gateway for workflow executions : Online and non-clairvoyant self-healing of workflow executions on grids

Rafael Ferreira da Silva 1
1 Images et Modèles
CREATIS - Centre de Recherche en Acquisition et Traitement de l'Image pour la Santé
Abstract : Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved in these gateways lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general self-healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels.
Document type :
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Friday, March 6, 2015 - 1:32:40 AM
Last modification on : Friday, October 23, 2020 - 5:02:40 PM
Long-term archiving on: : Sunday, June 7, 2015 - 10:25:18 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01124002, version 1


Rafael Ferreira da Silva. A science-gateway for workflow executions : Online and non-clairvoyant self-healing of workflow executions on grids. Computer Aided Engineering. INSA de Lyon, 2013. English. ⟨NNT : 2013ISAL0115⟩. ⟨tel-01124002⟩



Record views


Files downloads