Resilient and energy-efficient scheduling algorithms at scale

Guillaume Aupy 1, 2
Abstract : This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We address these two issues in two main parts. The first part focuses on the importance of reliability for future Exascale platforms, while the second part discusses how to improve the energy consumption of these platforms. Considering the relative slopes describing the evolution of the reliability of individual components on one side, and the evolution of the number of components on the other side, the reliability of the entire platform is expected to decrease, due to probabilistic amplification. The mean time between two failures on Exascale systems is expected to be shorter than the time to do a system checkpoint. In the first part of this thesis, we focus on the optimal placement of periodic coordinated checkpoints to minimize execution time. In a general context, we start by improving the first-order period formula obtained by Young and Daly. We then consider fault predictors, a software used by system administrators that tries to predict (through the study of passed events) where and when faults will strike. These predictors are imperfect, they only predict a proportion of failures called ``recall'', and amongst their predictions, only a percentage called ``precision'' are actual faults. Finally, we also consider fault predictors that do not give the exact moment where a fault may occur, but instead offer an interval of time where the fault is likely to occur. In these contexts, we propose efficient algorithms, and give a first-order optimal formula for the amount of work that should be done between two checkpoints. We then revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. While reliability is a major concern for Exascale, another key challenge is to minimize energy consumption, which we address in the second part of the thesis. The principal reason is that Exascale systems are expected to have similar requirements as current Petascale system components, but at an extreme scale capacity, namely 1000 times today’s capacity! In order to be able to bring the necessary power to these systems, the thermal power by $cm^2$ needs to be reduced drastically. One of the most power-consuming components of today’s systems is the processor. Even when idle, it dissipates a significant fraction of the total power. However, for future Exascale systems, the power dissipated to execute I/O transfers is likely to play an even more important role, because the relative cost of communications is expected to dramatically increase, both in terms of latency and consumed energy. The speed scaling technique consists in diminishing the voltage of the processor, hence diminishing its execution speed. It is commonly assumed that the voltage is proportional to the frequency, implying a dynamic power in the cube of the frequency. Unfortunately, it was also pointed out that DVFS increases the probability of failures exponentially, and that this probability cannot be neglected within a large-scale computing framework. In this context, we consider the speed scaling technique coupled with reliability-increasing techniques such as re-execution, replication or checkpointing. For these different problems, we propose various algorithms whose efficiency is shown either through thorough simulations, or approximation results relatively to the optimal solution. Finally, we consider the different energetic costs involved in periodic coordinated checkpointing and compute the optimal period to minimize energy consumption, as we did for execution time.
Complete list of metadatas

Cited literature [145 references]  Display  Hide  Download

https://hal.inria.fr/tel-01075111
Contributor : Equipe Roma <>
Submitted on : Thursday, October 16, 2014 - 4:06:31 PM
Last modification on : Wednesday, November 20, 2019 - 3:17:52 AM
Long-term archiving on: Saturday, January 17, 2015 - 10:50:12 AM

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

  • HAL Id : tel-01075111, version 1

Collections

Citation

Guillaume Aupy. Resilient and energy-efficient scheduling algorithms at scale. Data Structures and Algorithms [cs.DS]. École Normale Supérieure de Lyon, 2014. English. ⟨NNT : 2014ENSL0928⟩. ⟨tel-01075111⟩

Share

Metrics

Record views

525

Files downloads

383