Skip to Main content Skip to Navigation

Fault-tolerant and energy-aware algorithms for workflows and real-time systems

Li Han 1, 2 
Abstract : This thesis is focused on the two major problems in the high performance computing context: resilience and energyconsumption.To satisfy the computing power required by modern scientific research, the number of computing units insupercomputers increases dramatically in the past years. This leads to more frequent errors than expected. Obviously,failure handling is critical for highly parallel applications that use a large number of components for a significant amountof time. Otherwise, one may spend infinite time re-executing. At the other side, power management is necessary due toboth monetary and environmental constraints. Especially because resilience often calls for redundancy in time and/or inspace , which in turn consumes extra energy. In addition, technologies that reduce energy consumption often havenegative effects on performance and resilience.In this context, we re-design scheduling algorithms to investigate trade-offs between performance, resilience and energyconsumption. The first part is focused around task graph scheduling and fail-stop errors. Which task should becheckpointed (redundancy in time) in order to minimize the total execution time? The objective is to design optimalsolutions for special classes of task graphs, and to provide general-purpose heuristics for arbitrary ones. Then in thesecond part of the thesis, we consider periodically independent task sets, which is the context of real-time scheduling,and silent errors. We investigate the number of replicas (redundancy in space) that are needed, and the interplay betweendeadlines, energy minimization and reliability.
Complete list of metadata

Cited literature [132 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Monday, June 1, 2020 - 6:38:07 PM
Last modification on : Monday, May 16, 2022 - 4:46:02 PM
Long-term archiving on: : Friday, September 25, 2020 - 6:57:36 AM


Version validated by the jury (STAR)


  • HAL Id : tel-02713064, version 1


Li Han. Fault-tolerant and energy-aware algorithms for workflows and real-time systems. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Lyon; East China normal university (Shanghai), 2020. English. ⟨NNT : 2020LYSEN013⟩. ⟨tel-02713064⟩



Record views


Files downloads