Skip to Main content Skip to Navigation
Theses

Combining checkpointing and other resilience mechanisms for exascale systems

Dounia Bentria 1, 2
Abstract : In this thesis, we are interested in scheduling and optimization problems in probabilistic contexts. The contributions of this thesis come in two parts. The first part is dedicated to the optimization of different fault-Tolerance mechanisms for very large scale machines that are subject to a probability of failure and the second part is devoted to the optimization of the expected sensor data acquisition cost when evaluating a query expressed as a tree of disjunctive Boolean operators applied to Boolean predicates. In the first chapter, we present the related work of the first part and then we introduce some new general results that are useful for resilience on exascale systems.In the second chapter, we study a unified model for several well-Known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies. We propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated exascale designs.In the third, fourth, and fifth chapters, we study the combination of different fault tolerant mechanisms (replication, fault prediction and detection of silent errors) with the traditional checkpoint/restart mechanism. We evaluated several models using simulations. Our results show that these models are useful for a set of models of applications in the context of future exascale systems.In the second part of the thesis, we study the problem of minimizing the expected sensor data acquisition cost when evaluating a query expressed as a tree of disjunctive Boolean operators applied to Boolean predicates. The problem is to determine the order in which predicates should be evaluated so as to shortcut part of the query evaluation and minimize the expected cost.In the sixth chapter, we present the related work of the second part and in the seventh chapter, we study the problem for queries expressed as a disjunctive normal form. We consider the more general case where each data stream can appear in multiple predicates and we consider two models, the model where each predicate can access a single stream and the model where each predicate can access multiple streams.
Document type :
Theses
Complete list of metadatas

Cited literature [121 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01127150
Contributor : Abes Star :  Contact
Submitted on : Saturday, March 7, 2015 - 12:28:58 AM
Last modification on : Wednesday, November 20, 2019 - 3:27:16 AM
Long-term archiving on: : Monday, June 8, 2015 - 10:56:34 AM

File

2014ENSL0971.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01127150, version 1

Collections

Citation

Dounia Bentria. Combining checkpointing and other resilience mechanisms for exascale systems. Other [cs.OH]. Ecole normale supérieure de lyon - ENS LYON, 2014. English. ⟨NNT : 2014ENSL0971⟩. ⟨tel-01127150⟩

Share

Metrics

Record views

658

Files downloads

384