G. Aupy, A. Benoit, H. Casanova, and Y. Robert, Scheduling Computational Workflows on Failure-Prone Platforms, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015.
DOI : 10.1109/IPDPSW.2015.33

URL : https://hal.archives-ouvertes.fr/hal-01075100

G. Aupy, A. Benoit, T. Hérault, Y. Robert, F. Vivien et al., On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.11-20
DOI : 10.1109/PRDC.2013.10

URL : https://hal.archives-ouvertes.fr/hal-00836871

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00788313

A. Avizienis, J. Laprie, B. Randell, and C. E. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004.
DOI : 10.1109/TDSC.2004.2

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.219.5446

P. Balaprakash, L. A. Gomez, M. Bouguerra, S. M. Wild, F. Cappello et al., Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing, Proc. PMBS'14, 2014.
DOI : 10.1007/978-3-319-17248-4_13

N. Bansal, T. Kimbrel, and K. Pruhs, Speed scaling to manage energy and temperature, Journal of the ACM, vol.54, issue.1, pp.1-39, 2007.
DOI : 10.1145/1206035.1206038

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.550.7426

L. , B. Gomez, and F. Cappello, Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation, Proceedings of the 1st International Workshop on Fault Tolerant Systems. FTS'15, 2015.
DOI : 10.1109/cluster.2015.108

L. , B. Gomez, and F. Cappello, Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP '14, pp.381-382, 2014.

L. , B. Gomez, and F. Cappello, Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, In: SIGPLAN Notices, vol.498, pp.381-382, 2014.

L. , B. Gomez, and F. Cappello, Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption, Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications. HPCC'15, p.206, 2015.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-01298430

A. Benoit, S. K. Raina, and Y. Robert, Efficient checkpoint/verification patterns, The International Journal of High Performance Computing Applications, vol.40, issue.1, pp.52-65, 2017.
DOI : 10.1147/rd.401.0003

URL : https://hal.archives-ouvertes.fr/ensl-01252342

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical timestepping schemes, High Performance Computing Applications DOI, pp.10-1177, 2014.
DOI : 10.1177/1094342014532297

URL : http://arxiv.org/pdf/1312.2674

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015.
DOI : 10.1145/1810085.1810120

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

G. Bosilca, Unified model for assessing checkpointing protocols at extremescale, Concurrency and Computation: Practice and Experience, 2013.
DOI : 10.1002/cpe.3173

URL : https://hal.archives-ouvertes.fr/hal-00908447

S. Boyd and L. Vandenberghe, Convex Optimization, 2004.

P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen, Fault-tolerant iterative methods via selective reliability, pp.ArXiv e-prints, 2012.

A. G. Bromley, Charles Babbage's Analytical Engine, 1838, In: IEEE Annals of the History of Computing, vol.43, pp.196-217, 1982.
DOI : 10.1109/mahc.1982.10028

URL : http://athena.union.edu/~hemmendd/courses/cs80/an-engine.pdf

G. Bronevetsky and B. De-supinski, Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008.
DOI : 10.1145/1375527.1375552

D. M. Brooks, P. Bose, S. E. Schuster, H. Jacobson, P. N. Kudva et al., Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors, IEEE Micro, vol.206, pp.26-44, 2000.

F. Cappello, E. M. Constantinescu, P. D. Hovland, T. Peterka, C. Phillips et al., Improving the trust in results of numerical simulations and scientific data analytics, 2015.
DOI : 10.2172/1179023

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, pp.374-388, 2009.
DOI : 10.1515/9781400882618-003

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.7068

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, 2014.
DOI : 10.1515/9781400882618-003

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.7068

H. Casanova, M. Bougeret, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, In: Int. Journal of High Performance Computing Applications, vol.282, pp.210-224, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00668016

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015.
DOI : 10.1016/j.future.2015.04.003

URL : https://hal.archives-ouvertes.fr/hal-01199752

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

J. Chen and C. Kuo, Energy-Efficient Scheduling for Real-Time Systems on Dynamic Voltage Scaling (DVS) Platforms, 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2007), 2007.
DOI : 10.1109/RTCSA.2007.37

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 2013, pp.167-176

E. Ciocca, I. Koren, Z. Koren, C. M. Krishna, and D. S. Katz, Application-level fault tolerance in the orbital thermal imaging spectrometer, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.
DOI : 10.1109/PRDC.2004.1276551

E. Ciocca, I. Koren, and C. M. Krishna, Determining acceptance tests for applicationlevel fault detection, Proceedings of the 2nd ASPLOS EASY Workshop, pp.47-53, 2002.

R. Cohen and L. Katzir, The Generalized Maximum Coverage Problem, Information Processing Letters, vol.108, issue.1, pp.15-22, 2008.
DOI : 10.1016/j.ipl.2008.03.017

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.7439

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2001.

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming Models and Development Software for a Space-Based Many-Core Processor, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology, pp.95-102, 2011.
DOI : 10.1109/SMC-IT.2011.29

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multilevel checkpoint model for large scale HPC applications, Proc. IPDPS'14, 2014.

S. Di and F. Cappello, Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications, IEEE Transactions on Parallel and Distributed Systems, vol.27, issue.10, 2016.
DOI : 10.1109/TPDS.2016.2517639

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016.
DOI : 10.1109/TPDS.2016.2546248

URL : https://hal.archives-ouvertes.fr/hal-01263879

J. Dongarra, The International Exascale Software Project roadmap, The International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011.
DOI : 10.2172/471364

URL : http://www.exascale.org/mediawiki/images/2/20/IESP-roadmap.pdf

J. Dongarra, The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, The International Journal of High Performance Computing Applications, vol.27, issue.1, pp.309-322, 2009.
DOI : 10.1016/S0167-8191(00)00087-9

J. Dongarra, T. Hérault, and Y. Robert, Performance and reliability trade-offs for the double checkpointing algorithm, International Journal of Networking and Computing, vol.4, issue.1, pp.23-41, 2014.
DOI : 10.15803/ijnc.4.1_23

URL : https://hal.archives-ouvertes.fr/hal-01091928

M. Dow, Explicit inverses of Toeplitz and associated matrices, ANZIAM Journal, vol.44, pp.185-215, 2003.
DOI : 10.21914/anziamj.v44i0.493

J. Elliott, M. Hoemmen, and F. Mueller, Evaluating the Impact of SDC on the GMRES Iterative Solver, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp.1193-1202, 2014.
DOI : 10.1109/IPDPS.2014.123

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626
DOI : 10.1109/ICDCS.2012.56

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.2542

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

E. N. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

E. N. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in largescale highh performance computing systems, In: PDCN. IASTED, 2009.

C. Engelmann and B. Swen, Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011.
DOI : 10.2316/P.2011.719-031

D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin et al., Razor: circuit-level correction of timing errors for low-power operation, IEEE Micro, vol.24, issue.6, pp.10-20, 2004.
DOI : 10.1109/MM.2004.85

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-4412, 2011.
DOI : 10.1145/2063384.2063443

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12. 2012, p.78

F. Firouzi, M. E. Salehi, F. Wang, and S. M. Fakhraie, An accurate model for soft error rate estimation considering dynamic voltage and frequency scaling effects, Microelectronics Reliability, vol.51, issue.2, pp.460-467, 2011.
DOI : 10.1016/j.microrel.2010.08.016

R. G. Gallager, Stochastic Processes: Theory for Applications, 2014.

M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979.

R. Ge, X. Feng, and K. W. Cameron, Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters, ACM/IEEE SC 2005 Conference (SC'05), p.34, 2005.
DOI : 10.1109/SC.2005.57

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.1233

A. Geist, How to kill a supercomputer: Dirty power, cosmic rays, and bad solder, In: IEEE Spectrum, 2016.

C. George and S. S. Vadhiyar, ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.
DOI : 10.1016/j.procs.2012.04.018

URL : http://doi.org/10.1016/j.procs.2012.04.018

D. Hakkarinen and Z. Chen, Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013.
DOI : 10.1109/TC.2012.17

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput, vol.336, pp.518-528, 1984.

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design, In: SIGARCH Comput. Archit. News, vol.401, pp.111-122, 2012.

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, 2010.
DOI : 10.1109/ICPP.2010.80

G. Karakonstantis and K. Roy, Voltage over-scaling: A cross-layer design perspective for energy efficient systems, 2011 20th European Conference on Circuit Theory and Design (ECCTD), pp.548-551
DOI : 10.1109/ECCTD.2011.6043592

H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems, 2004.
DOI : 10.1007/978-3-540-24777-7

P. Krause and I. Polian, Adaptive voltage over-scaling for resilient applications, 2011 Design, Automation & Test in Europe, pp.1-6, 2011.
DOI : 10.1109/DATE.2011.5763153

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes In: 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, pp.49-56
DOI : 10.1145/2465813.2465821

R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson et al., Top ten exascale research challenges, In: DOE ASCAC subcommittee report, pp.1-86, 2014.

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

E. Meneses, X. Ni, T. Jones, and D. Maxwell, Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer, p.4, 2015.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM/IEEE SC Conf, pp.1-11, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

T. O. Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.414, pp.553-557, 1994.

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 2007.
DOI : 10.1109/MSST.2007.4367962

J. Plank, K. Li, and M. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.
DOI : 10.1109/71.730527

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.4662

F. Quaglia, A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Transactions on Parallel and Distributed Systems, vol.12, issue.4, pp.346-362, 2001.
DOI : 10.1109/71.920586

S. Ramasubramanian, S. Venkataramani, A. Parandhaman, and A. Raghunathan, Relaxand-Retime: A methodology for energy-efficient recovery based design, Design Automation Conference (DAC). 2013, pp.1-6

A. Randall, The Eckert tapes: Computer pioneer says ENIAC team couldnt afford to fail?and didnt, In: Computerworld, vol.408, p.18, 2006.

M. W. Rashid and M. C. Huang, Supporting highly-decoupled thread-level redundancy for parallel programs, 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pp.393-404, 2008.
DOI : 10.1109/HPCA.2008.4658655

N. B. Rizvandi, A. Y. Zomaya, Y. C. Lee, A. J. Boloori, and J. Taheri, Multiple Frequency Selection in DVFS-Enabled Processors to Minimize Energy Consumption, 2012.
DOI : 10.1109/MM.2005.70

URL : http://arxiv.org/abs/1203.5160

R. Rojas and U. Hashagen, The First Computers: History and Architectures. History of computing, p.9780262681377, 2002.

P. Sao and R. Vuduc, Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013.
DOI : 10.1145/2530268.2530272

B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, 2007.
DOI : 10.1088/1742-6596/78/1/012022

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.69.2659

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78
DOI : 10.1145/2304576.2304588

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings -Software 145, pp.198-202, 1998.
DOI : 10.1049/ip-sen:19982440

URL : http://estudogeral.sib.uc.pt/jspui/bitstream/10316/12927/1/Using%20two-level%20stable%20storage.pdf

M. Snir, Addressing failures in exascale computing, The International Journal of High Performance Computing Applications, vol.37, issue.13, pp.129-173, 2014.
DOI : 10.1016/j.anucene.2010.01.017

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.640.2201

V. Sridharan, N. Debardeleben, S. Blanchard, K. B. Ferreira, J. Stearley et al., Memory Errors in Modern Systems, ACM SIGPLAN Notices, vol.50, issue.4, pp.297-310
DOI : 10.1145/1815961.1815973

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off?, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012.
DOI : 10.1109/DSNW.2012.6264669

S. Toueg and Ö. Babaolu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, 1984.
DOI : 10.1137/0213039

URL : http://ecommons.cornell.edu/bitstream/1813/6386/1/83-546.pdf

N. H. Vaidya, A Case for Two-level Distributed Recovery Schemes In: SIGMETRICS Perform, Eval. Rev, vol.231, pp.64-73, 1995.

L. Wang, G. Von-laszewski, J. Dayal, and F. Wang, Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
DOI : 10.1109/CCGRID.2010.19

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010.
DOI : 10.1109/HPCS.2010.5547140

URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, In: ICCIT. IEEE, 2011.

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

Z. Zheng, L. Yu, and Z. Lan, Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart, IEEE Transactions on Computers, vol.64, issue.5, pp.1402-1415, 2015.
DOI : 10.1109/TC.2014.2317182

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM Experiments in Soft Fails in Computer Electronics, In: IBM J. Res. Dev, vol.401, pp.3-18, 1996.
DOI : 10.1147/rd.401.0003

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.
DOI : 10.1109/4.658626

[. Book-chapters, A. Aupy, A. Benoit, M. Cavelan, Y. Fasi et al., Coping with silent errors in HPC applications In: Emergent Computation, pp.269-292, 2016.

J. , L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina et al., Coping with recall and precision of soft error detectors, Articles in International Refereed Journals [, pp.8-24, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01246639

A. [. Benoit, V. Cavelan, Y. Le-fèvre, H. Robert, and . Sun, Towards Optimal Multi-Level Checkpointing, IEEE Transactions on Computers, vol.66, issue.7, 2016.
DOI : 10.1109/TC.2016.2643660

URL : https://hal.archives-ouvertes.fr/hal-01339788

A. [. Benoit, Y. Cavelan, H. Robert, and . Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, In: ACM Transactions on Parallel Computing, vol.3, issue.2, p.13, 2016.
DOI : 10.1007/978-3-319-17248-4_11

URL : https://hal.archives-ouvertes.fr/hal-01066664

A. [. Benoit, Y. Cavelan, H. Robert, and . Sun, Multi-level checkpointing and silent error detection for linear workflows, Journal of Computational Science, 2017.
DOI : 10.1016/j.jocs.2017.03.024

URL : https://hal.archives-ouvertes.fr/hal-01363581

C. , L. Bautista-gomez, A. Benoit, A. Cavelan, S. K. Raina et al., Which verification for soft error detection, Articles in International Refereed Conferences [ International Conference on High Performance Computing (HiPC). IEEE. 2015, pp.2-11
DOI : 10.1109/hipc.2015.26

URL : https://hal.archives-ouvertes.fr/hal-01252382

A. [. Benoit, Y. Cavelan, H. Robert, and . Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.202-211
DOI : 10.1109/IPDPS.2016.39

URL : https://hal.archives-ouvertes.fr/hal-01354886

J. [. Cavelan, Y. Li, H. Robert, and . Sun, When Amdahl Meets Young/Daly, 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp.203-212
DOI : 10.1109/CLUSTER.2016.17

URL : https://hal.archives-ouvertes.fr/hal-01355963

S. [. Cavelan, Y. Raina, H. Robert, H. Sun-robert, F. Sun et al., Assessing the impact of partial verifications against silent data corruptions Scheduling Independent Tasks with Voltage Overscaling, International Conference on Parallel Processing 214 APPENDIX . PUBLICATIONS [C5] A. Cavelan Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE. 2015, pp.440-449, 2015.

W. A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale , FTXS '17, 2017.
DOI : 10.1147/rd.401.0003

URL : https://hal.archives-ouvertes.fr/hal-01494678

A. [. Benoit, V. L. Cavelan, Y. Fèvre, and . Robert, Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale , FTXS '17, 2017.
DOI : 10.1109/CLUSTR.2009.5289177

URL : https://hal.archives-ouvertes.fr/hal-01504936

A. [. Benoit, V. Cavelan, Y. Le-fèvre, H. Robert, and . Sun, A Different Re-execution Speed Can Help, 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
DOI : 10.1109/ICPPW.2016.45

URL : https://hal.archives-ouvertes.fr/hal-01354887

A. [. Benoit, Y. Cavelan, H. Robert, and . Sun, Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors, 7th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2014.
DOI : 10.1007/978-3-319-17248-4_11

URL : https://hal.archives-ouvertes.fr/hal-01066664

A. [. Benoit, Y. Cavelan, H. Robert, and . Sun, Two-Level Checkpointing and Verifications for Linear Task Graphs, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.1239-1248
DOI : 10.1109/IPDPSW.2016.106

URL : https://hal.archives-ouvertes.fr/hal-01252400

Y. [. Cavelan, H. Robert, F. Sun, and . Vivien, Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS '15, pp.27-34
DOI : 10.1109/JETCAS.2011.2135550

URL : https://hal.archives-ouvertes.fr/hal-01121065