Skip to Main content Skip to Navigation

Nouveaux Protocoles de Tolérances aux Fautes pour les Applications MPI du Calcul Haute Performance

Abstract : With the evolution of parallel computers, the need for fault tolerance protocols is becoming increasingly important. The existing fault tolerance protocols are not adapted to thèse architectures because they either force a global restard (coordinated checkpointing protocols) or all message logging (message logging protocols). We studied the characteristics of the existing protocols. We first studied the determinism of the applications, since existing protocols assumenon deterministic or piecewise deterministic executions. In our study, we examined the message passing model, and more specifically MPI applications. We have analyzed26 MPI applications and have put forward a new characteristic called "send-determinism" which corresponds to moststudied applications. In a second step, we studied the communication patterns of the applications to study the existence of clusters of processes in these patterns. The study showed that for most applications, it is possible to create clusters of processes to minimize the size of clusters and the volume of inter-cluster messages. Then we designed two fault tolérance protocols. The first one is an uncoordinated checkpointing protocol which is based on the send-deterministic assumption and avoids emissions deterministic domino effect while logging only a subset of the application messages. We have also adapted the protocol to clusters of processes. Then, we proposed HydEE, a hierarchical protocol that is lso based on the send-deterministic assumption and that is used on clusters of processes. It combines coordinated checkpointing protocol inside clusters to a message logging protocol for inter-cluster messages.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Friday, February 3, 2012 - 2:32:40 PM
Last modification on : Sunday, June 26, 2022 - 11:55:26 AM
Long-term archiving on: : Thursday, November 22, 2012 - 10:40:54 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00666063, version 1



Amina Guermouche. Nouveaux Protocoles de Tolérances aux Fautes pour les Applications MPI du Calcul Haute Performance. Autre [cs.OH]. Université Paris Sud - Paris XI, 2011. Français. ⟨NNT : 2011PA112281⟩. ⟨tel-00666063⟩



Record views


Files downloads