Skip to Main content Skip to Navigation

Programmation des systèmes parallèles distribués : tolérance aux pannes, résilience et adaptabilité

Abstract : Grid and cluster architectures are gaining in popularity for scientific computing applications. The distributed computations, as well as their underlying infrastructure consisting of a large number of computers, storage and networking devices, pose challenges in overcoming the effects of node failures. This work presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods are presented, i.e. Systematic Event Logging (SEL) and Theft-Induced Checkpointing TIC, which are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large platform. It is shown that both methods have very small overhead and that trade-offs between
checkpointing and recovery cost can be controlled.
Complete list of metadata
Contributor : Samir Jafar <>
Submitted on : Friday, July 14, 2006 - 12:27:23 PM
Last modification on : Friday, November 6, 2020 - 4:39:43 AM
Long-term archiving on: : Tuesday, September 18, 2012 - 4:11:31 PM


  • HAL Id : tel-00085169, version 1



Samir Jafar. Programmation des systèmes parallèles distribués : tolérance aux pannes, résilience et adaptabilité. Réseaux et télécommunications [cs.NI]. Institut National Polytechnique de Grenoble - INPG, 2006. Français. ⟨tel-00085169⟩



Record views


Files downloads