Skip to Main content Skip to Navigation
Theses

Fault-tolerant and energy-aware algorithms for workflows and real-time systems

Abstract : This thesis is focused on the two major problems in the high performance computing context: resilience and energyconsumption.To satisfy the computing power required by modern scientific research, the number of computing units insupercomputers increases dramatically in the past years. This leads to more frequent errors than expected. Obviously,failure handling is critical for highly parallel applications that use a large number of components for a significant amountof time. Otherwise, one may spend infinite time re-executing. At the other side, power management is necessary due toboth monetary and environmental constraints. Especially because resilience often calls for redundancy in time and/or inspace , which in turn consumes extra energy. In addition, technologies that reduce energy consumption often havenegative effects on performance and resilience.In this context, we re-design scheduling algorithms to investigate trade-offs between performance, resilience and energyconsumption. The first part is focused around task graph scheduling and fail-stop errors. Which task should becheckpointed (redundancy in time) in order to minimize the total execution time? The objective is to design optimalsolutions for special classes of task graphs, and to provide general-purpose heuristics for arbitrary ones. Then in thesecond part of the thesis, we consider periodically independent task sets, which is the context of real-time scheduling,and silent errors. We investigate the number of replicas (redundancy in space) that are needed, and the interplay betweendeadlines, energy minimization and reliability.
Complete list of metadatas

Cited literature [132 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02713064
Contributor : Abes Star :  Contact
Submitted on : Monday, June 1, 2020 - 6:38:07 PM
Last modification on : Wednesday, June 3, 2020 - 3:15:13 AM
Long-term archiving on: : Friday, September 25, 2020 - 6:57:36 AM

File

HAN_Li_2020LYSEN013_These.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02713064, version 1

Collections

Citation

Li Han. Fault-tolerant and energy-aware algorithms for workflows and real-time systems. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Lyon; East China normal university (Shanghai), 2020. English. ⟨NNT : 2020LYSEN013⟩. ⟨tel-02713064⟩

Share

Metrics

Record views

96

Files downloads

75