Skip to Main content Skip to Navigation
Theses

Resilient scheduling algorithms for large-scale platforms

Valentin Le Fevre 1, 2
Abstract : This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bigger and bigger in order to reach what we call exascale, i.e. a computing capacity of 10^18 FLOP/s but they suffer numerous failures. Reducing the execution time and handling the errors are two linked problems: for instance, replication (computing redudancy) decreases the number of critical failures but also decreases the number of available resources. In particular, this thesis focuses on several “checkpoint/restart” mechanisms.(saving the state of an application to restart from that save when a failure occurs): the first part investigates checkpointing on several levels, the use of additional resources to cope with system latency and checkpointing in generic task-graphs. The second part deals with optimal checkpointing strategies when coupled with replication (in linear task graphs, on heterogeneous platforms and with process duplication). The last part explores several scheduling problems linked to increasing disruptions in large-scale platforms.
Complete list of metadatas

Cited literature [234 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02947051
Contributor : Abes Star :  Contact
Submitted on : Wednesday, September 23, 2020 - 4:07:02 PM
Last modification on : Wednesday, September 30, 2020 - 3:34:09 AM

File

LE_FEVRE_Valentin_2020LYSEN019...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02947051, version 1

Citation

Valentin Le Fevre. Resilient scheduling algorithms for large-scale platforms. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Lyon, 2020. English. ⟨NNT : 2020LYSEN019⟩. ⟨tel-02947051⟩

Share

Metrics

Record views

75

Files downloads

59