Skip to Main content Skip to Navigation
Theses

Scheduling algorithms and resilience patterns for fail-stop and silent errors

Aurélien Cavelan 1, 2
Abstract : This thesis focuses on resilience for high performance applications that execute on large scale platforms, with millions of processing cores. On such platforms, errors are the norm rather than the exception. We consider two types of errors: fail-stop errors, which generally cause the application to stop, and silent-errors, a.k.a. Silent Data Corruption or SDCs, which can corrupt data in memory. Silent errors pose a new threat to scientific applications, because they are both difficult to detect and to correct. In this thesis, we first study several detection mechanisms for silent errors. We model the impact of such detectors on the execution of scientific applications, which allows us to decide which one to use when multiple choices are available. Then, we combine both fail-stop errors and silent errors into one resilience pattern: the application periodically verify and checkpoint the results. Thus, in case of failure, it is not necessary to re-execute everything from scratch. The goal is to minimize the execution time or the energy consumption. In this context, we extend several results from the literature by deriving the optimal resilience pattern for different types of applications. We also provide several exact scheduling algorithms that execute in polynomial time, as well as heuristics for application workflows. Finally, models are validated through an exhaustive set of simulations, and by comparing against the state-of-the-art when possible.
Keywords : Models Errors
Complete list of metadatas

Cited literature [115 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01582228
Contributor : Abes Star :  Contact
Submitted on : Tuesday, September 5, 2017 - 6:52:15 PM
Last modification on : Wednesday, November 20, 2019 - 3:27:33 AM

File

CAVELAN_Aurelien_2017LYSEN031_...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01582228, version 1

Citation

Aurélien Cavelan. Scheduling algorithms and resilience patterns for fail-stop and silent errors. Performance [cs.PF]. Université de Lyon, 2017. English. ⟨NNT : 2017LYSEN031⟩. ⟨tel-01582228⟩

Share

Metrics

Record views

714

Files downloads

340