Scheduling Computational Workflows on Failure-Prone Platforms, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015. ,
DOI : 10.1109/IPDPSW.2015.33
URL : https://hal.archives-ouvertes.fr/hal-01075100
On the Combination of Silent Error Detection and Checkpointing, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.11-20 ,
DOI : 10.1109/PRDC.2013.10
URL : https://hal.archives-ouvertes.fr/hal-00836871
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313
Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004. ,
DOI : 10.1109/TDSC.2004.2
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.219.5446
Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing, Proc. PMBS'14, 2014. ,
DOI : 10.1007/978-3-319-17248-4_13
Speed scaling to manage energy and temperature, Journal of the ACM, vol.54, issue.1, pp.1-39, 2007. ,
DOI : 10.1145/1206035.1206038
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.550.7426
Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation, Proceedings of the 1st International Workshop on Fault Tolerant Systems. FTS'15, 2015. ,
DOI : 10.1109/cluster.2015.108
Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP '14, pp.381-382, 2014. ,
Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, In: SIGPLAN Notices, vol.498, pp.381-382, 2014. ,
Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption, Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications. HPCC'15, p.206, 2015. ,
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-01298430
Efficient checkpoint/verification patterns, The International Journal of High Performance Computing Applications, vol.40, issue.1, pp.52-65, 2017. ,
DOI : 10.1147/rd.401.0003
URL : https://hal.archives-ouvertes.fr/ensl-01252342
Silent error detection in numerical timestepping schemes, High Performance Computing Applications DOI, pp.10-1177, 2014. ,
DOI : 10.1177/1094342014532297
URL : http://arxiv.org/pdf/1312.2674
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, 2015. ,
DOI : 10.1145/1810085.1810120
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Unified model for assessing checkpointing protocols at extremescale, Concurrency and Computation: Practice and Experience, 2013. ,
DOI : 10.1002/cpe.3173
URL : https://hal.archives-ouvertes.fr/hal-00908447
Convex Optimization, 2004. ,
Fault-tolerant iterative methods via selective reliability, pp.ArXiv e-prints, 2012. ,
Charles Babbage's Analytical Engine, 1838, In: IEEE Annals of the History of Computing, vol.43, pp.196-217, 1982. ,
DOI : 10.1109/mahc.1982.10028
URL : http://athena.union.edu/~hemmendd/courses/cs80/an-engine.pdf
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors, IEEE Micro, vol.206, pp.26-44, 2000. ,
Improving the trust in results of numerical simulations and scientific data analytics, 2015. ,
DOI : 10.2172/1179023
Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, pp.374-388, 2009. ,
DOI : 10.1515/9781400882618-003
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.7068
Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.29, issue.2, 2014. ,
DOI : 10.1515/9781400882618-003
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.7068
Using group replication for resilience on exascale systems, In: Int. Journal of High Performance Computing Applications, vol.282, pp.210-224, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00668016
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015. ,
DOI : 10.1016/j.future.2015.04.003
URL : https://hal.archives-ouvertes.fr/hal-01199752
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
Energy-Efficient Scheduling for Real-Time Systems on Dynamic Voltage Scaling (DVS) Platforms, 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2007), 2007. ,
DOI : 10.1109/RTCSA.2007.37
Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 2013, pp.167-176 ,
Application-level fault tolerance in the orbital thermal imaging spectrometer, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings. ,
DOI : 10.1109/PRDC.2004.1276551
Determining acceptance tests for applicationlevel fault detection, Proceedings of the 2nd ASPLOS EASY Workshop, pp.47-53, 2002. ,
The Generalized Maximum Coverage Problem, Information Processing Letters, vol.108, issue.1, pp.15-22, 2008. ,
DOI : 10.1016/j.ipl.2008.03.017
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.7439
Introduction to Algorithms, 2001. ,
Programming Models and Development Software for a Space-Based Many-Core Processor, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
DOI : 10.1109/SMC-IT.2011.29
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
Optimization of multilevel checkpoint model for large scale HPC applications, Proc. IPDPS'14, 2014. ,
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications, IEEE Transactions on Parallel and Distributed Systems, vol.27, issue.10, 2016. ,
DOI : 10.1109/TPDS.2016.2517639
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016. ,
DOI : 10.1109/TPDS.2016.2546248
URL : https://hal.archives-ouvertes.fr/hal-01263879
The International Exascale Software Project roadmap, The International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011. ,
DOI : 10.2172/471364
URL : http://www.exascale.org/mediawiki/images/2/20/IESP-roadmap.pdf
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, The International Journal of High Performance Computing Applications, vol.27, issue.1, pp.309-322, 2009. ,
DOI : 10.1016/S0167-8191(00)00087-9
Performance and reliability trade-offs for the double checkpointing algorithm, International Journal of Networking and Computing, vol.4, issue.1, pp.23-41, 2014. ,
DOI : 10.15803/ijnc.4.1_23
URL : https://hal.archives-ouvertes.fr/hal-01091928
Explicit inverses of Toeplitz and associated matrices, ANZIAM Journal, vol.44, pp.185-215, 2003. ,
DOI : 10.21914/anziamj.v44i0.493
Evaluating the Impact of SDC on the GMRES Iterative Solver, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp.1193-1202, 2014. ,
DOI : 10.1109/IPDPS.2014.123
Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp.615-626 ,
DOI : 10.1109/ICDCS.2012.56
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.2542
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
DOI : 10.1109/TDSC.2004.15
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
DOI : 10.1109/TDSC.2004.15
The case for modular redundancy in largescale highh performance computing systems, In: PDCN. IASTED, 2009. ,
Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011. ,
DOI : 10.2316/P.2011.719-031
Razor: circuit-level correction of timing errors for low-power operation, IEEE Micro, vol.24, issue.6, pp.10-20, 2004. ,
DOI : 10.1109/MM.2004.85
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-4412, 2011. ,
DOI : 10.1145/2063384.2063443
Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC'12. 2012, p.78 ,
An accurate model for soft error rate estimation considering dynamic voltage and frequency scaling effects, Microelectronics Reliability, vol.51, issue.2, pp.460-467, 2011. ,
DOI : 10.1016/j.microrel.2010.08.016
Stochastic Processes: Theory for Applications, 2014. ,
Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979. ,
Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters, ACM/IEEE SC 2005 Conference (SC'05), p.34, 2005. ,
DOI : 10.1109/SC.2005.57
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.1233
How to kill a supercomputer: Dirty power, cosmic rays, and bad solder, In: IEEE Spectrum, 2016. ,
ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
DOI : 10.1016/j.procs.2012.04.018
URL : http://doi.org/10.1016/j.procs.2012.04.018
Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013. ,
DOI : 10.1109/TC.2012.17
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063444
Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput, vol.336, pp.518-528, 1984. ,
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design, In: SIGARCH Comput. Archit. News, vol.401, pp.111-122, 2012. ,
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, 2010. ,
DOI : 10.1109/ICPP.2010.80
Voltage over-scaling: A cross-layer design perspective for energy efficient systems, 2011 20th European Conference on Circuit Theory and Design (ECCTD), pp.548-551 ,
DOI : 10.1109/ECCTD.2011.6043592
Knapsack Problems, 2004. ,
DOI : 10.1007/978-3-540-24777-7
Adaptive voltage over-scaling for resilient applications, 2011 Design, Automation & Test in Europe, pp.1-6, 2011. ,
DOI : 10.1109/DATE.2011.5763153
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes In: 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
When is multi-version checkpointing needed?, Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS '13, pp.49-56 ,
DOI : 10.1145/2465813.2465821
Top ten exascale research challenges, In: DOE ASCAC subcommittee report, pp.1-86, 2014. ,
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer, p.4, 2015. ,
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM/IEEE SC Conf, pp.1-11, 2010. ,
ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013. ,
DOI : 10.1145/2503210.2503266
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.414, pp.553-557, 1994. ,
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 2007. ,
DOI : 10.1109/MSST.2007.4367962
Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998. ,
DOI : 10.1109/71.730527
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.4662
A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Transactions on Parallel and Distributed Systems, vol.12, issue.4, pp.346-362, 2001. ,
DOI : 10.1109/71.920586
Relaxand-Retime: A methodology for energy-efficient recovery based design, Design Automation Conference (DAC). 2013, pp.1-6 ,
The Eckert tapes: Computer pioneer says ENIAC team couldnt afford to fail?and didnt, In: Computerworld, vol.408, p.18, 2006. ,
Supporting highly-decoupled thread-level redundancy for parallel programs, 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pp.393-404, 2008. ,
DOI : 10.1109/HPCA.2008.4658655
Multiple Frequency Selection in DVFS-Enabled Processors to Minimize Energy Consumption, 2012. ,
DOI : 10.1109/MM.2005.70
URL : http://arxiv.org/abs/1203.5160
The First Computers: History and Architectures. History of computing, p.9780262681377, 2002. ,
Self-stabilizing iterative solvers, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, 2013. ,
DOI : 10.1145/2530268.2530272
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, 2007. ,
DOI : 10.1088/1742-6596/78/1/012022
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.69.2659
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78 ,
DOI : 10.1145/2304576.2304588
Using two-level stable storage for efficient checkpointing, IEE Proceedings -Software 145, pp.198-202, 1998. ,
DOI : 10.1049/ip-sen:19982440
URL : http://estudogeral.sib.uc.pt/jspui/bitstream/10316/12927/1/Using%20two-level%20stable%20storage.pdf
Addressing failures in exascale computing, The International Journal of High Performance Computing Applications, vol.37, issue.13, pp.129-173, 2014. ,
DOI : 10.1016/j.anucene.2010.01.017
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.640.2201
Memory Errors in Modern Systems, ACM SIGPLAN Notices, vol.50, issue.4, pp.297-310 ,
DOI : 10.1145/1815961.1815973
Does partial replication pay off?, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012. ,
DOI : 10.1109/DSNW.2012.6264669
On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, 1984. ,
DOI : 10.1137/0213039
URL : http://ecommons.cornell.edu/bitstream/1813/6386/1/83-546.pdf
A Case for Two-level Distributed Recovery Schemes In: SIGMETRICS Perform, Eval. Rev, vol.231, pp.64-73, 1995. ,
Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010. ,
DOI : 10.1109/CCGRID.2010.19
Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010. ,
DOI : 10.1109/HPCS.2010.5547140
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Thread-level redundancy fault tolerant CMP based on relaxed input replication, In: ICCIT. IEEE, 2011. ,
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI ,
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart, IEEE Transactions on Computers, vol.64, issue.5, pp.1402-1415, 2015. ,
DOI : 10.1109/TC.2014.2317182
IBM Experiments in Soft Fails in Computer Electronics, In: IBM J. Res. Dev, vol.401, pp.3-18, 1996. ,
DOI : 10.1147/rd.401.0003
Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998. ,
DOI : 10.1109/4.658626
Coping with silent errors in HPC applications In: Emergent Computation, pp.269-292, 2016. ,
Coping with recall and precision of soft error detectors, Articles in International Refereed Journals [, pp.8-24, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01246639
Towards Optimal Multi-Level Checkpointing, IEEE Transactions on Computers, vol.66, issue.7, 2016. ,
DOI : 10.1109/TC.2016.2643660
URL : https://hal.archives-ouvertes.fr/hal-01339788
Assessing general-purpose algorithms to cope with fail-stop and silent errors, In: ACM Transactions on Parallel Computing, vol.3, issue.2, p.13, 2016. ,
DOI : 10.1007/978-3-319-17248-4_11
URL : https://hal.archives-ouvertes.fr/hal-01066664
Multi-level checkpointing and silent error detection for linear workflows, Journal of Computational Science, 2017. ,
DOI : 10.1016/j.jocs.2017.03.024
URL : https://hal.archives-ouvertes.fr/hal-01363581
Which verification for soft error detection, Articles in International Refereed Conferences [ International Conference on High Performance Computing (HiPC). IEEE. 2015, pp.2-11 ,
DOI : 10.1109/hipc.2015.26
URL : https://hal.archives-ouvertes.fr/hal-01252382
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.202-211 ,
DOI : 10.1109/IPDPS.2016.39
URL : https://hal.archives-ouvertes.fr/hal-01354886
When Amdahl Meets Young/Daly, 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp.203-212 ,
DOI : 10.1109/CLUSTER.2016.17
URL : https://hal.archives-ouvertes.fr/hal-01355963
Assessing the impact of partial verifications against silent data corruptions Scheduling Independent Tasks with Voltage Overscaling, International Conference on Parallel Processing 214 APPENDIX . PUBLICATIONS [C5] A. Cavelan Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE. 2015, pp.440-449, 2015. ,
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale , FTXS '17, 2017. ,
DOI : 10.1147/rd.401.0003
URL : https://hal.archives-ouvertes.fr/hal-01494678
Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale , FTXS '17, 2017. ,
DOI : 10.1109/CLUSTR.2009.5289177
URL : https://hal.archives-ouvertes.fr/hal-01504936
A Different Re-execution Speed Can Help, 2016 45th International Conference on Parallel Processing Workshops (ICPPW) ,
DOI : 10.1109/ICPPW.2016.45
URL : https://hal.archives-ouvertes.fr/hal-01354887
Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors, 7th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2014. ,
DOI : 10.1007/978-3-319-17248-4_11
URL : https://hal.archives-ouvertes.fr/hal-01066664
Two-Level Checkpointing and Verifications for Linear Task Graphs, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.1239-1248 ,
DOI : 10.1109/IPDPSW.2016.106
URL : https://hal.archives-ouvertes.fr/hal-01252400
Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS '15, pp.27-34 ,
DOI : 10.1109/JETCAS.2011.2135550
URL : https://hal.archives-ouvertes.fr/hal-01121065