, We generate the different sets of applications using the following method: let n be the number of unused nodes. At the beginning we set n = N. 1. Draw uniformly at random an integer number x between 1 and max(1, n 4096 ? 1), These values where based on the applications we previously studied

, Add to the set an application App (k) with parameters w (k) and vol

M. , 152 nodes) on which we run the online algorithms (either maximizing the system efficiency or minimizing the dilation) and PerSched. The results are presented on Figures 9.7a and 9.7b for simulations using the Intrepid settings and Figures 9.8a and 9, vol.49

M. Albrecht, P. Donnelly, P. Bui, and D. Thain, Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids, In: 1st ACM SWEET SIGMOD, 2012.

I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher et al., Kepler: an extensible system for design and execution of scientific workflows, Proc. of 16th SSDBM, pp.423-424, 2004.

G. , The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967.

A. Workflows, , 2016.

, Argonne Leadership Computing Facility. Mira log traces

R. A. Ashraf, S. Hukerikar, and C. Engelmann, Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery, 2018.

I. Assayad, A. Girault, and H. Kalla, A Bi-Criteria Scheduling Heuristic for Distributed Embedded Systems under Reliability and Real-Time Constraints, Dependable Systems Networks (DSN), 2004.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, In: Concur. and Comp.: Pract. and Exp, vol.23, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

G. Aupy, O. Beaumont, and L. Eyraud-dubois, What size should your Buffers to Disk be?, In: Proceedings of the 32nd International Parallel Processing Symposium, (IPDPS'18, 2018.

G. Aupy, O. Beaumont, and L. Eyraud-dubois, Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention, Proceedings of the 33rd International Parallel Processing Symposium, p.19, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02141616

G. Aupy, A. Benoit, H. Casanova, and Y. Robert, Scheduling computational workflows on failure-prone platforms, Int. J. of Networking and Computing, vol.6, pp.2-26, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01075100

G. Aupy, Y. Robert, and F. Vivien, Assuming failure independence: are we right to be wrong?, In: FTS'2017, the Workshop on Fault-Tolerant Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01654639

B. S. Baker, E. G. Coffman, and R. L. Rivest, Orthogonal Packings in Two Dimensions, In: SIAM Journal on Computing, vol.9, pp.846-855, 1980.

P. Balaprakash, L. A. Gomez, M. Bouguerra, S. M. Wild, F. Cappello et al., Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing, Proc. PMBS'14, 2014.

L. Bautista-gomez and F. Cappello, Detecting Silent Data Corruption Through Data Dynamic Monitoring for Scientific Applications, SIGPLAN Notices, vol.49, pp.381-382, 2014.

L. Bautista-gomez and F. Cappello, Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation, In: FTS. IEEE, 2015.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI: High Performance Fault Tolerance Interface for Hybrid Systems, Proc. SC'11, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00721216

L. Bautista-gomez, F. Zyulkyarov, O. Unsal, and S. Mcintosh-smith, Unprotected Computing: A Large-scale Study of DRAM Raw Error Rate on a Supercomputer, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC '16. Salt Lake City, vol.55, 2016.

. Behzad, Taming parallel I/O complexity with auto-tuning, p.13, 2013.

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.202-211, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01354886

A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Coping with silent and fail-stop errors at scale by combining replication and checkpointing, In: J. Parallel and Distributed Computing, vol.122, pp.209-225, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02082389

A. Benoit, A. Cavelan, F. Ciorba, V. L. Fèvre, and Y. Robert, Combining checkpointing and replication for reliable execution of linear workflows, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963655

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors, In: ACM Trans. Parallel Comput, vol.3, issue.2, pp.1-13, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, T. Herault, V. L. Fèvre, and Y. Robert, Replication Is More Efficient Than You Think: Code and, 2019.

E. Berrocal, L. Bautista-gomez, S. Di, Z. Lan, and F. Cappello, Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications, 2015.

S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M. Su et al., Characterization of scientific workflows, Workflows in Support of Large-Scale Science (WORKS), pp.1-10, 2008.

R. Biswas, M. Aftosmis, C. Kiris, and B. Shen, Petascale computing: Impact on future NASA missions, Petascale Computing: Architectures and Algorithms, pp.29-46, 2007.

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Post-failure recovery of MPI communication capability: Design and rationale, In: International Journal of High Performance Computing Applications, vol.27, pp.244-254, 2013.

W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca et al., An evaluation of User-Level Failure Mitigation support in MPI, In: Computing 95, vol.12, pp.1171-1184, 2013.

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, In: Concurrency and Computation: Practice and Experience, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00696154

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, J. Parallel Distrib. Comput, vol.69, pp.410-416, 2009.

S. Boyd and L. Vandenberghe, Convex Optimization, p.521833787, 2004.

T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran et al., A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems, In: Journal of Parallel and Distributed computing, vol.61, pp.810-837, 2001.

R. Brightwell, K. Ferreira, and R. Riesen, Transparent Redundant Computing with MPI, 2010.

G. H. Bryan and J. M. Fritsch, A benchmark simulation for moist nonhydrostatic numerical models, In: Monthly Weather Review, vol.130, 2002.

G. L. Bryan, Enzo: An Adaptive Mesh Refinement Code for Astrophysics, 2013.

E. S. Buneci, Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008.

S. Byna, Y. Chen, X. Sun, R. Thakur, and W. Gropp, Parallel I/O prefetching using MPI file caching and I/O signatures, SC '08: Proceedings of the, 2008.

, ACM/IEEE Conference on Supercomputing, pp.1-12, 2008.

C. Cao, T. Herault, G. Bosilca, and J. Dongarra, Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: IPDPS. IEEE, pp.765-774, 2015.

F. Cappello and K. Mohror, Very Low Overhead Checkpointing System, 2019.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, In: Int. J. High Performance Computing Applications, vol.23, pp.374-388, 2009.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 update, In: Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

P. Carns, R. Latham, R. Ross, K. Iskra, S. Lang et al., 24/7 characterization of petascale I/O workloads, Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, pp.1-10, 2009.

J. Carter, J. Borrill, and L. Oliker, Performance characteristics of a cosmology package on leading HPC architectures, pp.176-188, 2005.

H. Casanova, M. Bougeret, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, In: Int. Journal of High Performance Computing Applications, vol.28, pp.210-224, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00668016

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Gen. Comp. Syst, vol.51, pp.7-19, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01199752

K. M. Chandy and L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, In: ACM Transactions on Computer Systems, vol.3, pp.63-75, 1985.

B. Chen and A. P. Vestjens, Scheduling on identical machines: How good is LPT in an on-line setting, Operations Research Letters, vol.21, pp.165-169, 1997.

C. Chen, G. Eisenhauer, M. Wolf, and S. Pande, LADR: Low-cost Applicationlevel Detector for Reducing Silent Output Corruptions, In: HPDC. Tempe, Arizona, pp.156-167, 2018.

Z. Chen, Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods, In: SIGPLAN Not, vol.48, pp.167-176, 2013.

J. Choi, J. J. Dongarra, L. S. Ostrouchov, A. P. Petitet, D. W. Walker et al., Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines, In: Scientific Programming, vol.5, pp.173-184, 1996.

W. Cirne and F. Berman, Using Moldability to Improve the Performance of Supercomputer Jobs, J. Parallel Distrib. Comput, vol.62, pp.1571-1601, 2002.

E. G. Coffman, M. R. Garey, D. S. Johnson, and R. E. Tarjan, Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms, In: SIAM J. Comput, vol.9, pp.808-826, 1980.

P. Colella, Chombo infrastructure for adaptive mesh refinement, 2005.

, Argonne and Livermore National Laboratorie. DRAFT CORAL-2 BUILD STATEMENT OF WORK, 2018.

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming Models and Development Software for a Space-Based Many-Core Processor, 4th Int. Conf. on Space Mission Challenges for Information Technology, pp.95-102, 2011.

V. Cuevas-vicenttín, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, Scientific Workflows and Provenance: Introduction and Research Opportunities, In: Datenbank-Spektrum, vol.12, issue.3, pp.193-203, 2012.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

A. Darte, Y. Robert, and F. Vivien, Scheduling and automatic parallelization, Birkhäuser, pp.978-981, 2000.
URL : https://hal.archives-ouvertes.fr/hal-00856645

E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil et al., Pegasus: A framework for mapping complex scientific workflows onto distributed systems, In: Scientific Programming, vol.13, pp.219-237, 2005.

E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan et al., Pegasus, a Workflow Management System for Science Automation, Future Generation Computer Systems, vol.46, pp.17-35, 2015.

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multi-level checkpoint model for large scale HPC applications, Proc. IPDPS'14, 2014.

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, In: IEEE Trans. Parallel & Distributed Systems, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01353871

M. E. Diouri, O. Glück, L. Lefevre, and F. Cappello, Energy considerations in checkpointing and fault tolerance protocols, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012, pp.1-6, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00748006

M. Dorier, G. Antoniu, R. Ross, D. Kimpe, and S. Ibrahim, CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination, IPDPS'14, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00916091

M. Dorier, S. Ibrahim, G. Antoniu, and R. Ross, Omnisc'IO: a grammar-based approach to spatial and temporal I/O patterns prediction, pp.623-634, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01025670

A. B. Downey, The structural cause of file size distributions, pp.361-370, 2001.

M. Drozdowski, Scheduling for Parallel Processing, 2009.

P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, Algorithm-based Fault Tolerance for Dense Matrix Factorizations, In: PPoPP. ACM, pp.225-234, 2012.

P. Du, P. Luszczek, S. Tomov, and J. Dongarra, Soft error resilient QR factorization for hybrid system with GPGPU, Scalable Algorithms for Large-Scale Systems Workshop (ScalA2011), vol.4, pp.1877-7503, 2011.

P. Dutot, G. Mounié, and D. Trystram, In: Handbook of Scheduling -Algorithms, Models, and Performance Analysis, 2004.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, 2012.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, In: ACM Computing Survey, vol.34, pp.375-408, 2002.

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, In: IEEE Trans. Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, 2009.

C. Engelmann and B. Swen, Redundant execution of HPC applications with MR-MPI, In: PDCN. IASTED, 2011.

S. Ethier, M. Adams, J. Carter, and L. Oliker, Petascale parallelization of the gyrokinetic toroidal code, p.VECPAR, 2012.

T. Fahringer, R. Prodan, R. Duan, J. Hofer, F. Nadeem et al., Askalon: A development and grid computing environment for scientific workflows, pp.450-471, 2007.

A. Fang, H. Fujita, and A. A. Chien, Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery, Euro-Par 2015: Parallel Processing Workshops, pp.656-668, 2015.

D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong, Theory and Practice in Parallel Job Scheduling, pp.1-34, 1997.

A. Feldmann, J. Sgall, and S. Teng, Dynamic scheduling on parallel machines, In: Theoretical Computer Science, vol.130, pp.49-72, 1994.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the Viability of Process Replication Reliability for Exascale Systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, vol.44, p.12, 2011.

V. L. Fèvre,

P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger, On Ramanujan's Q-Function, In: J. Computational and Applied Mathematics, vol.58, pp.103-116, 1995.

A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert et al., Scheduling the I/O of HPC applications under congestion, IPDPS. IEEE, pp.1013-1022, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00983789

R. G. Gallager, Information theory and reliable communication, vol.2, 1968.

R. G. Gallager, Stochastic Processes: Theory for Applications, 2014.

M. R. Garey and R. L. Graham, Bounds for multiprocessor scheduling with resource constraints, In: SIAM J. Comput, vol.4, pp.187-200, 1975.

M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979.

E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online Tuning of EASY-Backfilling using Queue Reordering Policies, IEEE Transactions on Parallel and Distributed Systems, vol.29, pp.2304-2316, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

C. George and S. S. Vadhiyar, ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, In: Procedia Computer Science, vol.9, pp.166-175, 2012.

R. Guerraoui and A. Schiper, Fault-tolerance by replication in distributed systems, Reliable Software Technologies -Ada-Europe '96, pp.38-57, 1996.

P. Guhur, H. Zhang, T. Peterka, E. Constantinescu, and F. Cappello, Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers, 2016.

Y. Guo, W. Bland, P. Balaji, and X. Zhou, Fault tolerant MapReduce-MPI for HPC clusters, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, vol.34, p.12, 2015.

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC '17, vol.44, 2017.

S. Habib, The universe at extreme scale: multi-petaflop sky simulation on the BG/Q, p.4, 2012.

D. Hakkarinen and Z. Chen, Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, pp.772-783, 2013.

D. Hakkarinen, P. Wu, and Z. Chen, Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition, Parallel and Distributed Systems, vol.26, pp.1045-9219, 2015.

L. Han, L. Canon, H. Casanova, Y. Robert, and F. Vivien, Checkpointing workflows for fail-stop errors, IEEE Transactions on Computers, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01559967

X. Han, K. Iwama, D. Ye, and G. Zhang, Strip Packing vs. Bin Packing, Algorithmic Aspects in Information and Management, pp.358-367, 2007.

C. Hanen and A. Munier, Cyclic scheduling on parallel processors: an overview, 1993.

B. Harrod, Big data and scientific discovery, 2014.

J. He, J. Bent, A. Torres, G. Grider, G. Gibson et al., I/O Acceleration with Pattern Detection, Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing. HPDC '13, pp.25-36, 2013.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proc

A. Supercomputing, , vol.11, 2011.

T. Herault, Y. Robert, A. Bouteiller, D. Arnold, K. Ferreira et al., Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms, APDCM 2018, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01621295

, Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks, 2015.

A. Hori, K. Yoshinaga, T. Herault, A. Bouteiller, G. Bosilca et al., Sliding Substitution of Failed Nodes, Proceedings of the 22Nd European MPI Users' Group Meeting. EuroMPI '15, vol.14, pp.1-14, 2015.

W. Hu, G. Liu, Q. Li, Y. Jiang, and G. Cai, Storage wall for exascale supercomputing, In: Journal of Zhejiang University-SCIENCE, vol.2016, pp.10-25, 2016.

K. Huang and J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Comput, vol.33, pp.518-528, 1984.

Z. Hussain, T. Znati, and R. Melhem, Partial Redundancy in HPC Systems with Non-uniform Node Reliabilities, SC '18, 2018.

F. Isaila and J. Carretero, Making the case for data staging coordination and control for parallel applications, Workshop on Exascale MPI at Supercomputing Conference, 2015.

F. Isaila, J. Carretero, and R. Ross, Clarisse: A middleware for data-staging coordination and control on large-scale hpc platforms, Cluster, Cloud and Grid Computing (CCGrid), pp.346-355, 2016.

D. B. Jackson, Q. Snell, and M. J. Clement, Core Algorithms of the Maui Scheduler, pp.87-102, 2001.

K. Jansen, A (3/2+ ) Approximation Algorithm for Scheduling Moldable and Non-moldable Parallel Tasks, SPAA, pp.224-235, 2012.

H. Jin, X. Sun, Z. Zheng, Z. Lan, and B. Xie, Performance Under Failures of DAG-based Parallel Computing, CCGRID '09, 2009.

B. Johannes, Scheduling Parallel Jobs to Minimize the Makespan, In: J. of Scheduling, vol.9, issue.5, pp.433-452, 2006.

G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta et al., Characterizing and profiling scientific workflows, Future Generation Computer Systems, vol.29, pp.682-692, 2013.

E. Kail, P. , and M. Kozlovszky, A novel adaptive checkpointing method based on information obtained from workflow structure, Computer Science, vol.17, issue.3, 2016.

G. Kandaswamy, A. Mandal, and D. A. Reed, Fault Tolerance and Recovery of Scientific Workflows on Computational Grids, Proceedings of the, 2008.

, Eighth IEEE International Symposium on Cluster Computing and the Grid. CCGRID '08, pp.777-782, 2008.

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, Cluster Computing and the Grid, IEEE International Symposium on, pp.398-407, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788866

A. Kougkas, M. Dorier, R. Latham, R. Ross, and X. Sun, Leveraging Burst Buffer Coordination to Prevent I/O Interference, In: IEEE International Conference on eScience. IEEE, 2016.

S. Kumar, Characterization and modeling of PIDX parallel I/O for performance optimization, In: SC. ACM, 2013.

S. Kumar, Fundamental limits to Moore's law, 2015.

A. N. Lab, The Trinity project

. Lanl, Computer Failure Data Repository

A. Lazzarini, Advanced LIGO Data & Computing, 2003.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

Y. Li, L. B. Kish, . Heat, . Speed-and-error, . Limits et al., In: Fluctuation and Noise Letters, pp.127-131, 2006.

D. A. Lifka, The ANL/IBM SP Scheduling System, pp.295-303, 1995.

N. Liu, On the Role of Burst Buffers in Leadership-Class Storage Systems

G. K. Lockwood, S. Snyder, T. Wang, S. Byna, P. Carns et al., A year in the life of a parallel file system, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p.74, 2018.

A. Lodi, S. Martello, and M. Monaci, Two-dimensional packing problems: A survey, European Journal of Operational Research, vol.141, pp.241-252, 2002.

J. Lofstead, Managing variability in the IO performance of petascale storage systems, In: SC. IEEECS, 2010.

J. Lofstead and R. Ross, Insights for exascale IO APIs from building a petascale IO API, In: SC13. ACM, p.87, 2013.

R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson et al., Top ten exascale research challenges, DOE ASCAC subcommittee report, pp.1-86, 2014.

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, In: IBM J. Res. Dev, vol.6, pp.200-209, 1962.

D. P. Mehta, C. Shetters, and D. W. Bouldin, Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs, 2013.

P. Mendygral, N. Radcliffe, K. Kandalla, D. Porter, B. O'neill et al., WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code, The Astrophysical Journal Supplement Series, vol.228, p.23, 2017.

M. Mitzenmacher and E. , Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM/IEEE SC Conf, pp.1-11, 2010.

J. E. Moreira and V. K. Naik, Dynamic resource management on distributed systems using reconfigurable applications, In: IBM Journal of Research and Development, vol.41, pp.303-330, 1997.

A. W. Mu and D. G. Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, IEEE Trans. Parallel Distrib. Syst, vol.12, pp.529-543, 2001.

R. Nair and H. Tufo, Petascale atmospheric general circulation models, In: Journal of Physics: Conference Series, vol.78, p.12078, 2007.

E. Naroska and U. Schwiegelshohn, On an On-line Scheduling Problem for Parallel Jobs, Inf. Process. Lett, vol.81, pp.297-304, 2002.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013.

X. Ni, E. Meneses, and L. V. Kalé, Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm, Cluster Computing (CLUSTER), 2012 IEEE International Conference on, pp.364-372, 2012.

T. O'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, pp.553-557, 1994.

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, Proc. of IEEE MSST, pp.30-46, 2007.

. Pegasus and . Pegasus-workflow-generator,

A. Petitet, H. Casanova, J. Dongarra, Y. Robert, and R. C. Whaley, Parallel and Distributed Scientific Computing: A Numerical Linear Algebra Problem Solving Environment Designer's Perspective, Handbook on Parallel and Distributed Processing, 1999.

J. Plank, K. Li, and M. Puening, Diskless checkpointing, In: IEEE Trans. Parallel Dist. Systems, vol.9, pp.1045-9219, 1998.

A. Pothen and C. Sun, A mapping algorithm for parallel sparse Cholesky factorization, In: SIAM J. on Scientific Computing, vol.14, pp.1253-1257, 1993.

S. Prabhakaran, Dynamic Resource Management and Job Scheduling for High Performance Computing, 2016.

S. Prabhakaran, M. Neumann, and F. Wolf, Efficient Fault Tolerance Through Dynamic Node Replacement, 18th Int. Symp. on Cluster, Cloud and Grid Computing CCGRID, pp.163-172, 2018.

M. W. Rashid and M. C. Huang, Supporting highly-decoupled thread-level redundancy for parallel programs, 14th Int. Conf. on High-Performance Computer Architecture (HPCA), pp.393-404, 2008.

R. Riesen, K. Ferreira, and J. Stearley, See applications run and throughput jump: The case for redundant computing in HPC, Proc. of the Dependable Systems and Networks Workshops, pp.29-34, 2010.

. Sankaran, Direct numerical simulations of turbulent lean premixed combustion, In: Journal of Physics: conference series, vol.46, p.38, 2006.

N. El-sayed and B. Schroeder, Reading between the lines of failure logs: Understanding how HPC systems fail, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp.1-12, 2013.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, In: Journal of Physics: Conference Series, vol.78, 2007.

S. R. Seelam and P. J. Teller, Virtual I/O Scheduler: A Scheduler of Schedulers for Performance Virtualization, Proceedings VEE. San Diego, California, pp.105-115, 2007.

H. Shan and J. Shalf, Using IOR to Analyze the I/O Performance for HPC Platforms, 2007.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault Tolerant Preconditioned Conjugate Gradient for Sparse Linear System Solution, In: ICS. ACM, 2012.

D. B. Shmoys, J. Wein, and D. P. Williamson, Scheduling Parallel Machines On-line, In: SIAM J. Comput, vol.24, pp.1313-1331, 1995.

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings -Software, vol.145, pp.198-202, 1998.

R. F. Silva, W. Chen, G. Juve, K. Vahi, and E. Deelman, Community resources for enabling research in distributed scientific workflows, 2014 IEEE 10th International Conference on, vol.1, pp.177-184, 2014.

, Simulation Software. Computing the yield, 2018.

D. Skinner and W. Kramer, Understanding the Causes of Performance Variability in HPC Workloads, IEEE Workload Characterization Symposium, pp.137-149, 2005.

J. Skovira, W. Chan, H. Zhou, and D. A. Lifka, The EASY -LoadLeveler API Project, pp.41-47, 1996.

M. Snir, Addressing Failures in Exascale Computing, In: Int. J. High Perform. Comput. Appl, vol.28, pp.129-173, 2014.

S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, Characterization of backfilling strategies for parallel job scheduling, International Conference on Parallel Processing Workshop, 2002.

G. Staples, TORQUE Resource Manager, Proceedings of the ACM/IEEE Conference on Supercomputing, 2006.

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off?, In: FTXS. IEEE, 2012.

O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, Proc. CCGrid'2017, pp.452-457, 2017.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. Cristal, Programmer-directed Partial Redundancy for Resilient HPC, In: Computing Frontiers, 2015.

R. Sudarsan and C. J. Ribbens, Design and performance of a scheduling framework for resizable parallel applications, Parallel Computing, vol.36, issue.1, pp.48-64, 2010.

R. Sudarsan, C. J. Ribbens, and D. Farkas, Dynamic Resizing of Parallel Scientific Simulations: A Case Study Using LAMMPS, Int . Conf . Computational Science ICCS. Procedia, pp.175-184, 2009.

D. Talia, Workflow Systems for Science: Concepts and Tools, In: ISRN Software Engineering, 2013.

K. Tang, P. Huang, X. He, T. Lu, S. S. Vazhkudai et al., Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior, pp.87-98, 2017.

. Bibliography,

F. Tessier, P. Malakar, V. Vishwanath, E. Jeannot, and F. Isaila, Topology-aware data aggregation for intensive I/O on large-scale supercomputers, First Workshop on Optimization of Communication in HPC, pp.73-81, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01394741

T. Tobita and H. Kasahara, A standard task graph set for fair evaluation of multiprocessor scheduling algorithms, In: Journal of Scheduling, vol.5, pp.379-394, 2002.

. Top500, Top 500 Supercomputer Sites, 2019.

H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE transactions on parallel and distributed systems, vol.13, pp.260-274, 2002.

S. Toueg and . Babaoglu, On the Optimum Checkpoint Selection Problem, In: SIAM J. Comput, vol.13, pp.630-649, 1984.

J. Turek, J. L. Wolf, and P. S. Yu, Approximate Algorithms Scheduling Parallelizable Tasks, In: SPAA, 1992.

N. H. Vaidya, A Case for Two-level Distributed Recovery Schemes, Eval. Rev, vol.23, pp.64-73, 1995.

J. Valdes, R. E. Tarjan, and E. L. Lawler, The Recognition of Series Parallel Digraphs, Proc. of STOC'79, pp.1-12, 1979.

C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, Proactive process-level live migration in HPC environments, SC '08: Proc.ACM/IEEE Conference on Supercomputing, 2008.

P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan, Replication-Based Fault-Tolerance for Large-Scale Graph Processing, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.562-573, 2014.

E. Weisstein, Gauss hypergeometric function. From MathWorld-A Wolfram Web Resource

E. Weisstein, Incomplete Beta Function. From MathWorld-A Wolfram Web Resource

M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz et al., Swift: A language for distributed parallel scripting, In: Parallel Computing, vol.37, pp.633-652, 2011.

K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers et al., The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud, In: Nucleic acids research, p.328, 2013.

A. K. Wong and A. M. Goscinski, Evaluating the EASY-backfill Job Scheduling of Static Workloads on Clusters, 2007.

M. Y. Wu and D. D. Gajski, Hypertool: a programming aid for messagepassing systems, IEEE Trans. Parallel Distributed Systems, vol.1, pp.330-343, 1990.

P. Wu, C. Ding, L. Chen, F. Gao, T. Davies et al., Fault Tolerant Matrix-matrix Multiplication: Correcting Soft Errors On-line, ScalA'11, pp.25-28, 2011.

K. Yamamoto, A. Uno, H. Murai, T. Tsukamoto, F. Shoji et al., The K computer Operations: Experiences and Statistics, In: Procedia Computer Science (ICCS), vol.29, pp.576-585, 2014.

E. Yao, J. Zhang, M. Chen, G. Tan, and N. Sun, Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance, In: International Journal of High Performance Computing Applications, vol.29, pp.422-436, 2015.

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using Replication and Checkpointing for Reliable Task Management in Computational Grids, SC'10, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

A. B. Yoo, M. A. Jette, and M. Grondona, SLURM: Simple Linux Utility for Resource Management, pp.44-60, 2003.

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, pp.530-531, 1974.

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011.

F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki et al., Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform, Proc. 26th IEEE IPDPS, pp.1352-1363, 2012.

X. Zhang, K. Davis, and S. Jiang, Opportunistic data-driven execution of parallel programs for efficient I/O services, In: IPDPS'12, pp.330-341, 2012.

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpointbased fault tolerant runtime for Charm++ and MPI, IEEE Computer Society, pp.93-103, 2004.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.

Z. Zheng, L. Yu, and Z. Lan, Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart, IEEE Trans. Computers, vol.64, pp.1402-1415, 2015.

Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang et al., I/O-Aware Batch Scheduling for Petascale Computing Systems, pp.254-263, 2015.

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM Experiments in Soft Fails in Computer Electronics, In: IBM J. Res. Dev, vol.40, pp.3-18, 1996.

J. Ziegler, H. Muhlfeld, C. Montrose, H. Curtis, T. O'gorman et al., Accelerated testing for cosmic soft-error rate, In: IBM J. Res. Dev, vol.40, pp.51-72, 1996.

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, In: IEEE Journal of Solid-State Circuits, vol.33, pp.246-252, 1998.

, Articles in International Refereed Journals

A. Benoit, A. Cavelan, V. Le-fèvre, Y. Robert, and H. Sun, Towards Optimal Multi-Level Checkpointing, IEEE Transactions on Computers, 2016.
URL : https://hal.archives-ouvertes.fr/hal-02082416

G. Aupy, A. Gainaru, and V. Le-fèvre, I/O Scheduling Strategy for Periodic Applications, In: ACM Transactions on Parallel Computing, vol.6, issue.2, pp.2329-4949, 2019.

A. Benoit, A. Cavelan, F. M. Ciorba, V. L. Fèvre, and Y. Robert, Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors, International Journal of Networking and Computing, vol.9, pp.2185-2847, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02082369

L. Han, V. Le-fèvre, L. Canon, Y. Robert, and F. Vivien, A generic approach to scheduling and checkpointing workflows, The International Journal of High Performance Computing Applications, vol.33, pp.1255-1274, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02140295

V. Le-fèvre, T. Herault, Y. Robert, A. Bouteiller, A. Hori et al., Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms, In: Parallel Computing, vol.85, pp.167-8191, 2019.

, Articles in International Refereed Conferences

L. Han, V. Le-fèvre, L. Canon, Y. Robert, and F. Vivien, A Generic Approach to Scheduling and Checkpointing Workflows, Proceedings of the 47th International Conference on Parallel Processing, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02140295

A. Benoit, T. Herault, V. L. Fèvre, and Y. Robert, Replication is More Efficient than You Think, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC '19. Denver, Colorado, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02265925

, Articles in International Refereed Workshops

A. Benoit, A. Cavelan, V. Le-fèvre, Y. Robert, and H. Sun, A Different Reexecution Speed Can Help, 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pp.250-257, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01354887

G. Aupy, A. Gainaru, V. Le-fèvre, ;. S. Jarvis, S. Wright et al., In: High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, pp.978-981, 2017.

A. Benoit, A. Cavelan, V. L. Fèvre, and Y. Robert, Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale. FTXS '17. Washington, pp.9-16, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082847

A. Benoit, A. Cavelan, F. M. Ciorba, V. L. Fèvre, and Y. Robert, Combining Checkpointing and Replication for Reliable Execution of Linear Workflows, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.793-802, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963655

V. Le-fèvre, L. Bautista-gomez, O. Unsal, and M. Casas, Benchmarking and Simulation of High Performance Computer Systems (PMBS), IEEE/ACM Performance Modeling, pp.97-107, 2018.

V. Le-fèvre, G. Bosilca, A. Bouteiller, T. Herault, A. Hori et al., Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?, pp.787-799, 2018.

A. Benoit, V. Le-fèvre, P. Raghavan, Y. Robert, and H. Sun, Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2020.
URL : https://hal.archives-ouvertes.fr/hal-02317464