Tolerating Transient, Permanent, and Intermittent Failures

Swan Dubois 1, 2
2 Regal - Large-Scale Distributed Systems and Applications
LIP6 - Laboratoire d'Informatique de Paris 6, Inria Paris-Rocquencourt
Abstract : A distributed system is a system composed of a set of autonomous computation units endowed with communication abilities in order to solve a global task. This model is general enough to describe any kind of network (LAN, sensor network, ...). When the size of a distributed system gets larger or when it is deployed in hazardous environments, the possibility that some elements of the system are subject to faults (failure, memory corruption, hacking, ...) become impossible to elude. Faults can be classified according to duration, span, or nature. In this thesis, we focus on distributed systems that simultaneously tolerate several kinds of faults using three classical problems as case studies. We present first a distributed protocol simulating a single-writer multi-reader atomic register in the presence of transient faults and of permanent crash faults. This protocol relies on two re-usable tools: a communication primitive and a bounded timestamp scheme. Then, we study logical clock weak synchronization in the presence of transient faults and of intermittent Byzantine faults. We prove several impossibility results and provide a protocol that is optimal both with respect to impossibility result and with respect to recovery time. Finally, we define three new fault tolerance schemes in distributed systems that are subject to transient faults and to intermittent Byzantine faults. We design a protocol constructing a wide class of spanning trees that is optimal with respect to fault tolerance metrics defined for these three schemes.
Complete list of metadatas

Cited literature [133 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00663317
Contributor : Swan Dubois <>
Submitted on : Thursday, January 26, 2012 - 4:25:47 PM
Last modification on : Friday, March 22, 2019 - 1:31:49 AM
Long-term archiving on : Wednesday, December 14, 2016 - 2:10:01 AM

Identifiers

  • HAL Id : tel-00663317, version 1

Citation

Swan Dubois. Tolerating Transient, Permanent, and Intermittent Failures. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Pierre et Marie Curie - Paris VI, 2011. English. ⟨tel-00663317⟩

Share

Metrics

Record views

497

Files downloads

576