A. , ]. A. Agbaria, R. J. Friedman, . A. Russell, A. Lowry et al., Virtual machine based heterogeneous checkpointing Optimistic failure recovery for very large networks, pp.66-75, 1991.
DOI : 10.1002/spe.478

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.112.3200

]. L. Alvisi and K. Marzullo, Message logging: pessimistic, optimistic, causal, and optimal, IEEE Transactions on Software Engineering, vol.24, issue.2, pp.149-159, 1998.
DOI : 10.1109/32.666828

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.78

L. Alvisi, S. Rao, S. Husain, A. Mel, and E. Elnozahy, An analysis of communication induced checkpointing, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), p.242, 1999.
DOI : 10.1109/FTCS.1999.781058

]. L. Alvisi, K. Bhatia, and K. Marzullo, Causality tracking in causal message-logging protocols. Distributed Computing, pp.1-15, 2002.

A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004.
DOI : 10.1109/TDSC.2004.2

L. Baduel, F. Baude, D. D. Caromel, T. Bailey, W. Harris et al., Efficient, Flexible, and Typed Group Communications in Java The nas parallel benchmarks 2, Joint ACM Java Grande - ISCOPE Conference, pp.28-36, 1995.

R. Baldoni, A. Mostefaoui, J. Brzezinski, J. Hélary, and M. Raynal, Characterization of consistent global checkpoints in large-scale distributed systems, Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, p.314, 1995.
DOI : 10.1109/FTDCS.1995.525000

R. Baldoni, F. Quaglia, and B. Ciciani, A VP-accordant checkpointing protocol preventing useless checkpoints, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281), p.61, 1998.
DOI : 10.1109/RELDIS.1998.740475

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9572

R. Baldoni, F. Quaglia, P. R. Fornara, M. Baldoni, . F. Raynal et al., An index-based checkpointing algorithm for autonomous distributed systems, Euro- Par CAROMEL, C. DELBÉ, and L. HENRIO. Un protocole de tolérance aux pannes pour objets actifs non préemptifs. Technique et Science Informatiques, pp.181-192, 1999.
DOI : 10.1109/71.752783

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.1960

W. Bolosky, J. Douceur, D. Ely, M. G. Theimer, A. Bosilca et al., Faisability of a serverless distributed file system deployed on an existing set of desktop pcs Mpich-v : toward a scalable fault tolerant mpi for volatile nodes, ACM international conference on Measurement and modeling of computer systems, SIGMETRICS Supercomputing '02 : Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp.34-43, 2000.

S. Bouchenak, D. A. Hagimont, F. Bouteiller, T. Cappello, G. Herault et al., Pickling threads state in the java system Mpich-v2 : a fault tolerant mpi for volatile nodes based on pessimistic sender based message logging Mpich-v3 : A hierarchical fault tolerant mpi for multi-cluster grids, Third European Research Seminar on Advances in Distributed Systems (ERSADS'99) SC '03 : Proceedings of the 2003 ACM/IEEE conference on Supercomputing IEEE/ACM SC 2003, 1999.

S. Bouchenak, D. Hagimont, S. Krakowiak, N. De, F. Palma et al., Experiences implementing efficient Java thread serialization, mobility and persistence, IPDPS '05 : Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) -Papers, page 97 IEEE International Symposium on Reliability, Distributed Software, and Databases, pp.355-393, 1984.
DOI : 10.1002/spe.569

URL : https://hal.archives-ouvertes.fr/inria-00071923

G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, C3 : A system for automating application-level checkpointing of mpi programs, The 16th International Workshop on Languages and Compilers for Parallel Computers, 2003.

G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, Recent advances in checkpoint/recovery systems, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, pp.25-29, 2006.
DOI : 10.1109/IPDPS.2006.1639575

P. Bungale, S. Sridhar, and V. Krishnamurthy, A poll-free, low-latency approach to process state capture/recovery in heterogeneous computing systems, Networks, Parallel and Distributed Processing, and Applications, 2002.

F. J. Busca, P. Picconi, and . Sens, Pastis: A Highly-Scalable Multi-user Peer-to-Peer File System, Euro-Par, pp.1173-1182, 2005.
DOI : 10.1007/11549468_128

URL : https://hal.archives-ouvertes.fr/inria-00070712

D. J. Bustos-jimenez, A. Caromel, J. Costanzo, and . Piquer, Balancing Active Objects on a Peer to Peer Infrastructure, XXV International Conference of the Chilean Computer Science Society (SCCC'05), p.109, 2005.
DOI : 10.1109/SCCC.2005.1587872

URL : https://hal.archives-ouvertes.fr/inria-00001237

F. Cappello, E. Caron, M. Dayde, F. Desprez, E. Jeannot et al., Grid'5000 : a large scale, reconfigurable, controlable and monitorable Grid platform, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00000284

D. Caromel, F. Huet, and J. Vayssière, A simple security???Aware MOP for Java, REFLECTION '01 : Proceedings of the Third International Conference on Metalevel Architectures and Separation of Crosscutting Concerns, pp.118-125, 2001.
DOI : 10.1007/3-540-45429-2_9

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.7823

C. D. Caromel, L. Delbé, and . Henrio, A fault-tolerance protocol for asp calculus : Design and proof, 2004.
URL : https://hal.archives-ouvertes.fr/inria-00070752

L. D. Caromel, B. Henrio, and . Serpette, Asynchronous and deterministic objects, POPL '04 : Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp.123-134, 2004.
DOI : 10.1145/982962.964012

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.9095

D. Caromel, L. D. Henrio, A. Caromel, C. Costanzo, and . Mathieu, A Theory of Distributed Object : Asynchrony -Mobility -Groups -Components Peer-to-peer for computational grids : Mixing clusters and desktop machines, 2005.

C. D. Caromel, A. Delbé, and . Costanzo, Peer-to-peer and faulttolerance : Towards deployment based technical services, Second CoreGRID Workshop on Grid and Peer to Peer Systems Architecture, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00001238

C. D. Caromel, A. Delbé, M. Costanzo, and . Leyton, ProActive: an integrated platform for programming and running applications on Grids and P2P systems, Computational Methods in Science and Technology, vol.12, issue.1, 2006.
DOI : 10.12921/cmst.2006.12.01.69-77

URL : https://hal.archives-ouvertes.fr/hal-00125034

C. D. Caromel, L. Delbé, and . Henrio, Promised consistency for rollback recovery, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00071365

L. K. Chandy and . Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

S. T. Chandra and . Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM, vol.43, issue.2, pp.225-267, 1996.
DOI : 10.1145/226643.226647

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.498

. S. Chiba, A metaobject protocol for C++, ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA'95), SIGPLAN Notices, pp.285-299, 1995.

S. Choi and S. Deitz, Compiler support for automatic checkpointing, HPCS '02 : Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications, p.213, 2002.

]. D. Conan and G. Bernard, La reprise sur erreur par recouvrement arrière automatique dans les systèmes répartis, Parallélisme et répartitions (coll. Parallélisme, réseaux et répartition) Ed. J.F. Myoupo, Hermès, 1998.

C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezme-rita et al., Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, ACM/IEEE SC 2006 Conference (SC'06), 2006.
DOI : 10.1109/SC.2006.15

URL : https://hal.archives-ouvertes.fr/hal-00684891

A. O. Damani, V. Tarafdar, and . Garg, Optimistic recovery in multi-threaded distributed systems, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems, p.234, 1999.
DOI : 10.1109/RELDIS.1999.805099

D. E. Elnozahy, W. Johnson, . Zwaenpoel-]-e, W. Elnozahy, . E. Zwaenepoel et al., The Performance of Consistent Checkpointing Manetho : Transparent rollback-recovery with low overhead, limited rollback and fast output A survey of rollback-recovery protocols in message-passing systems, Proceedings of the 11th IEEE Symposium on Reliable Distributed Systems, pp.526-531375, 1992.

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

]. T. Enokido, H. Higaki, and M. Takizawa, Significant message precedence in object-based systems, Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250), p.284, 1998.
DOI : 10.1109/ICPADS.1998.741083

]. B. Folliot and P. Sens, GatoStar: A fault tolerant load sharing facility for parallel applications, EDCC-1 : Proceedings of the First European Dependable Computing Conference on Dependable Computing, pp.581-598, 1994.
DOI : 10.1007/3-540-58426-9_159

C. I. Foster and . Kesselman, The Grid : Blueprint for a new Computing Infrastructure, 1999.

A. Ganesh, A. Kermarrec, L. R. Massouli, J. Gioiosa, S. Sancho et al., Peer-to-peer membership management for gossip-based protocols, SC '05 : Proceedings of the 2005 ACM/IEEE conference on Supercomputing, pp.139-149, 2003.
DOI : 10.1109/TC.2003.1176982

J. Howell, Straightforward java persistence through checkpointing, Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), pp.322-334, 1999.

]. J. Hél-97a, A. Hélary, M. Mostéfaoui, and . Raynal, Preventing useless checkpoints in distributed computations, SRDS '97 : Proceedings of the 16th Symposium on Reliable Distributed Systems (SRDS '97, p.183, 1997.

]. J. Hél-97b, A. Hélary, M. Mostéfaoui, and . Raynal, Virtual precedence in asynchronous systems : Cencept and applications, WDAG '97 : Proceedings of the 11th International Workshop on Distributed Algorithms, pp.170-184, 1997.

]. J. Hél-99a, A. Hélary, M. Mostefaoui, and . Raynal, Communicationinduced determination of consistent snapshots, IEEE Trans. Parallel Distrib. Syst, vol.10, issue.9, pp.865-877, 1999.

]. J. Hél-99b, R. Hélary, M. Netzer, and . Raynal, Consistency issues in distributed checkpoints, IEEE Trans. Softw. Eng, vol.25, issue.2, pp.274-281, 1999.

J. Hélary, A. Mostefaoui, R. Netzer, and M. Raynal, Communication-based prevention of useless checkpoints in distributed computations, Distributed Computing, vol.13, issue.1, pp.29-43, 2000.
DOI : 10.1007/s004460050003

]. D. Johnson and W. Zwaenepoel, Sender-based message logging, The 7th annual international symposium on fault-tolerant computing, 1987.

]. D. Johnson and W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing, 7th Annual ACM Symposium on Principles of Distributed Computing, pp.171-181, 1988.
DOI : 10.1016/0196-6774(90)90022-7

. D. Johnson, Efficient transparent optimistic rollback recovery for distributed application programs, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems, 1993.
DOI : 10.1109/RELDIS.1993.393470

L. Kale and S. Krishnan, Charm++ : a portable concurrent object oriented system based on c++, OOPSLA '93 : Proceedings of the eighth annual conference on Object-oriented programming systems , languages, and applications, pp.91-108, 1993.

G. Kiczales and E. Hilsdale, Aspect-oriented programming ", ESEC/FSE-9 : Proceedings of the 8th European software engineering conference held jointly with 9th ACM SIGSOFT international symposium on Foundations of software engineering, p.313, 2001.

J. M. Killijian, J. Ruiz-garcia, and . Fabre, Using Compile-Time Reflection for Objects???State Capture, Lecture Notes in Computer Science, vol.1616, pp.150-160, 1999.
DOI : 10.1007/3-540-48443-4_15

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.5788

. Lam-78-]-l and . Lamport, Time, clocks, and the ordering of events in a distributed system, In Communications of the ACM, vol.21, pp.558-565, 1978.

L. Lamport and M. Massa, Cheap Paxos, International Conference on Dependable Systems and Networks, 2004, p.307, 2004.
DOI : 10.1109/DSN.2004.1311900

P. Lemarinier, A. Bouteiller, T. Herault, G. Krawezik, and F. Cappello, Improved message logging versus improved coordinated checkpointing for fault tolerant MPI, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), pp.115-124, 2004.
DOI : 10.1109/CLUSTR.2004.1392609

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.8108

]. H. Leo-94, D. Leong, and . Agrawal, Using message semantics to reduce rollback in optimistic message logging recovery schemes, 14th International Conference on Distributed Computing Systems, pp.227-234, 1994.

M. Lewis, A. Ferrari, M. Humphrey, J. Karpovich, M. Mor-gan et al., Support for extensibility and site autonomy in the Legion grid system object model, Journal of Parallel and Distributed Computing, vol.63, issue.5, pp.525-538, 2003.
DOI : 10.1016/S0743-7315(03)00012-1

]. C. Lin-03a, S. Lin, S. Wang, and . Kuo, An efficient time-based checkpointing protocol for mobile computing systems over mobile ip, Mobile Networks and Applications, vol.8, issue.6, pp.687-697, 2003.
DOI : 10.1023/A:1026086712672

D. Manivannan and M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, Proceedings of 16th International Conference on Distributed Computing Systems, p.100, 1996.
DOI : 10.1109/ICDCS.1996.507906

]. D. Manivannan and M. Singhal, Quasi-synchronous checkpointing: Models, characterization, and classification, IEEE Transactions on Parallel and Distributed Systems, vol.10, issue.7, pp.703-713, 1999.
DOI : 10.1109/71.780865

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.6623

D. Manivannan and M. Singhal, Asynchronous recovery without using vector timestamps, Journal of Parallel and Distributed Computing, vol.62, issue.12, pp.1695-1728, 2002.
DOI : 10.1016/S0743-7315(02)00005-9

]. O. Marin, M. Bertier, and P. Sens, Darx -a framework for the fault-tolerant support of agent software. issre, p.406, 2003.

N. Mittal and V. Garg, Debugging distributed programs using controlled re-execution, Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing , PODC '00, pp.239-248, 2000.
DOI : 10.1145/343477.343624

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.1976

R. A. Muthitacharoen, T. Morris, B. Gil, and . Chen, Ivy, Proceedings of 5th Symposium on Operating Systems Design and Implementation, 2002.
DOI : 10.1145/844128.844132

H. Nakamura, T. Hayashida, M. Kondo, Y. Tajima, M. Imai et al., Skewed checkpointing for tolerating multi-node failures, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004., pp.116-125, 2004.
DOI : 10.1109/RELDIS.2004.1353012

]. R. Netzer and J. Xu, Necessary and sufficient conditions for consistent global snapshots, IEEE Transactions on Parallel and Distributed Systems, vol.6, issue.2, pp.165-169, 1995.
DOI : 10.1109/71.342127

R. Nieuwpoort, J. Maassen, G. Wrzesiska, R. Hofman, C. Ja-cobs et al., Ibis, Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande , JGI '02, pp.1079-1107, 2005.
DOI : 10.1145/583810.583813

. J. Planck, Efficient Checkpointing on MIMD Architecures, 1993.

M. J. Plank, G. Beck, K. Kingsley, and . Li, Libckpt : Transparent Checkpointing under Unix, Proceedings of USENIX Winter1995 Technical Conference, pp.213-224, 1995.

J. J. Plank, R. Xu, and . Netzer, Compressed differences : An algorithm for fast incremental checkpointing, 1995.

. S. Pla-98-]-j, K. Plank, M. A. Li, and . Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.

S. Rao, L. Alvisi, H. S. Vin, J. Sankaran, B. Squyres et al., The cost of recovery in message logging protocols The LAM/MPI checkpoint/restart framework : System-initiated checkpointing, IEEE Transactions on Knowledge and Data Engineering International Journal of High Performance Computing Applications, vol.12, issue.194, pp.160-173479, 2000.

M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali et al., Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs, Proceedings of the ACM/IEEE SC2004 Conference, p.38, 2004.
DOI : 10.1109/SC.2004.29

T. Sekiguchi, H. Masuhara, and A. Yonezawa, A Simple Extension of Java Language for Controllable Transparent Migration and its Portable Implementation, COORDINATION '99 : Proceedings of the Third International Conference on Coordination Languages and Models, pp.211-226, 1999.
DOI : 10.1007/3-540-48919-3_16

P. Sens, The performance of independent checkpointing in distributed systems, Proceedings of the Twenty-Eighth Hawaii International Conference on System Sciences, p.525, 1995.
DOI : 10.1109/HICSS.1995.375504

P. Sens, Contribution à l'intégration de la tolérance aux fautes dans les environnements répartis, Thèse d'Habilitation de l, 2000.

L. Silva and J. Silva, System-level versus user-defined checkpointing, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281), p.68, 1998.
DOI : 10.1109/RELDIS.1998.740476

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.6984

J. L. Silva and . Silva, The performance of coordinated and independent checkpointing, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999, pp.280-284, 1999.
DOI : 10.1109/IPPS.1999.760487

J. L. Silva and . Silva, Using message semantics for fast-output commit in checkpointing-and-rollback recovery, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers, 1999.
DOI : 10.1109/HICSS.1999.772986

J. A. Sistla and . Welch, Efficient distributed recovery using message logging, Proceedings of the eighth annual ACM Symposium on Principles of distributed computing , PODC '89, pp.223-238, 1989.
DOI : 10.1145/72981.72997

. Ste-96-]-g and . Stellner, Cocheck : Checkpointing and process migration for mpi, IPPS '96 : Proceedings of the 10th International Parallel Processing Symposium, pp.526-531, 1996.

S. R. Strom and . Yemini, Optimistic recovery in distributed systems, ACM Transactions on Computer Systems, vol.3, issue.3, pp.204-226, 1985.
DOI : 10.1145/3959.3962

D. R. Strom, S. Bacon, and . Yemini, Volatile logging in n-faulttolerant distributed systems, Proc IEEE Fault-tolerant Computing Symposium, pp.44-49, 1988.

. Str-98-]-v, . Strumpen, M. Todd-tannenbaum, and . Litzkow, Compiler technology for portable checkpoints Checkpointing and migration of unix processes in the Condor distributed processing system, Dr Dobbs Journal, 1995.

V. A. Tarafdar, V. Garg, and . Garg, Adressing false causality while detecting predicates in distributed programs Happened before is the wrong model for potential causality, 18th International Conference on Distributed Computing Systems, pp.94-101, 1998.

M. Tatsubori, S. Chiba, K. Itano, and M. Killijian, OpenJava: A Class-Based Macro System for Java, Reflection and Software Engineering, pp.117-133, 1999.
DOI : 10.1007/3-540-45046-7_7

D. Thain, T. Tannenbaum, M. Livny, B. Robben, B. Vanhaute et al., Distributed computing in practice : the condor experience. Concurrency - Practice and Experience Portable support for transparent thread migration in java, pp.323-356, 2000.

. Vai-99-]-n and . Vaidya, Staggered consistent checkpointing, IEEE Trans. Parallel Distrib. Syst, vol.10, issue.7, pp.694-702, 1999.

G. J. Waldo, A. Wyant, S. Wollrath, and . Kendall, A note on distributed computing, 1994.
DOI : 10.1007/3-540-62852-5_6

]. Y. Wang, Y. Huang, and W. K. Fuchs, Progressive retry for software error recovery in distributed systems, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing, pp.138-144, 1993.
DOI : 10.1109/FTCS.1993.627317

. F. Zambonelli, On the effectiveness of distributed checkpoint algorithms for domino-free recovery, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244), p.124, 1998.
DOI : 10.1109/HPDC.1998.709964

G. Zheng, L. Shi, and L. Kale, Ftc-charm++ : an in-memory checkpoint-based fault tolerant runtime for charm++ and mpi, CLUSTER, pp.93-103, 2004.