, Criu Metopon, Checkpoint Restart in Userspace

J. Weber, Sending Data from Other Sources Using GA?s Measurement Protocol, Practical Google Analytics and Google Tag Manager for Developers, pp.231-236, 2015.

, Overview of BIOVIA Materials Studio, LAMMPS, and GROMACS, Molecular Dynamics Simulation of Nanocomposites Using BIOVIA Materials Studio, Lammps and Gromacs, pp.39-100, 2019.

, Figure 11.4. The 500 most powerful non-distributed computer systems, by location, July 2020, pp.2020-2021

, Whitehead, Graham Wright, (died 30 June 2015), President, Jaguar Cars Inc., 1983?90; Chairman, Jaguar Canada Inc., Ontario, 1983?90; Director: Jaguar Cars Ltd, 1982?90; Jaguar plc, 1984?90, Specifications of Jaguar supercomputer, ranked first at Top500 in, pp.2020-2027, 2007.

H. M. Morgan, R. T. Mills, and B. Smith, Evaluation of PETSc on a Heterogeneious Architecture, the OLCF Summit System: Part 1: Vector Node Performance, Specifications of Summit supercomputer, ranked first at Top500 in, pp.2020-2027, 2020.

A. Hasan, E. Greg, W. Matthew, S. Karsten, and K. Scott, Just in Time: Adding Value to the IO Pipelines of High Performance Applications with JITStaging. In International symposium on High performance distributed computing, pp.27-36, 2011.

A. Hasan, W. Matthew, E. Greg, K. Scott, K. Schwan et al., DataStager: Scalable Data Staging Services for Petascale Applications, 18th ACM international symposium on High performance distributed computing, pp.39-48, 2009.

A. Bilge, Mitigating Variability in HPC Systems and Applications for Performance and Power Efficiency, 2017.

A. Maxim and M. Haim, Service provider competition: Delay cost structure, segmentation, and cost advantage, Manufacturing & Service Operations Management, vol.12, issue.2, pp.213-235, 2010.

A. Sean, S. Arie, M. A. Kwan-liu, C. Alok, C. Terence et al., Scientific discovery at the exascale. Report from the DOE ASCR 2011 Workshop on Exascale Data Management. Analysis, and Visualization, vol.2, 2011.

S. Ifrah, Getting Started with Containers on Amazon AWS, Deploy Containers on AWS, pp.1-40, 2019.

G. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring), pp.483-485, 1967.

A. Dan, H. William, Y. Huichen, and O. Adedolapo, Machine Learning for Predictive Analytics of Compute Cluster Jobs. CoRR, abs/1806.01116, 2018.

A. Jason, A. Kapil, and C. Gene, DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS'09), pp.1-12, 2009.

A. Cédric, T. Samuel, N. Raymond, and W. Pierre-andré, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, vol.23, pp.187-198, 2011.

A. Guillaume, B. Olivier, and E. Lionel, What Size Should your Buffers to Disks be? In International Parallel and Distributed Processing Symposium (IPDPS), 2018.

A. Guillaume, B. Olivier, and E. Lionel, Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention, IPDPS 2019 -33rd IEEE International Parallel and Distributed Processing Symposium, 2019.

A. Guillaume, A. Benoit, H. Thomas, R. Yves, V. Frédéric et al., On the Combination of Silent Error Detection and Checkpointing, IEEE 19th Pacific Rim International Symposium on Dependable Computing, PRDC 2013, pp.11-20, 2013.

A. Guillaume, R. Yves, V. Frédéric, and Z. Dounia, Checkpointing algorithms and fault prediction, J. Parallel Distrib. Comput, vol.74, issue.2, pp.2048-2064, 2014.

A. Utkarsh, W. Brad, W. Matthew, L. Burlen, G. Berk et al., The SENSEI Generic in Situ Interface, Proceedings of the 2Nd Workshop on In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization, ISAV '16, pp.40-44, 2016.

B. Michael, T. Sean, E. Slaughter, and A. Aiken, Legion: Expressing Locality and Independence with Logical Regions, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, vol.66, pp.1-66, 2012.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, pp.1-12, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00721216

P. Bazin, J. L. Cuzzocreo, M. A. Yassa, W. Gandler, M. J. Mcauliffe et al., Volumetric neuroimage analysis extensions for the MIPAV software package, Journal of Neuroscience Methods, vol.165, issue.1, pp.111-121, 2007.

B. Tal and H. Et-torsten, Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR, 2018.

C. Janine, . Bennett, A. Hasan, B. Peer-timo, G. Ray et al., Combining in-situ and in-transit processing to enable extremescale scientific analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.49, 2012.

B. John, S. Jerome, O. Guillaume, G. David, and P. Jean-guillaume, Parallel Computational Steering and Analysis for HPC Applications using a ParaView Interface and the HDF5 DSM Virtual File Driver, Torsten KUHLEN, Renato PAJAROLA et Kun ZHOU, éditeurs : Eurographics Symposium on Parallel Graphics and Visualization, pp.91-100, 2011.

J. Breitbart, S. Pickartz, S. Lankes, J. Weidendorfer, and A. Monti, Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.400-409, 2017.

B. François, C. Jérôme, M. Stéphanie, F. Nathalie, G. Brice et al., hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications, IEEE, éditeur : PDP 2010 -The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, 2010.

J. Bruno, P. Downey, and G. N. Frederickson, Sequencing Tasks with Exponential Service Times to Minimize the Expected Flow Time or Makespan, Journal of the ACM, vol.28, issue.1, pp.100-113, 1981.

C. Louis-claude, A. Kong-win, C. , Y. R. Frédéric, and V. , Scheduling independent stochastic tasks under deadline and budget constraints, Research Report, vol.9178, 2018.

C. Louis-claude and J. Emmanuel, Evaluation and optimization of the robustness of dag schedules in heterogeneous environments, IEEE Transactions on Parallel and Distributed Systems, vol.21, issue.4, pp.532-546, 2010.

C. Nicolas, D. A. Georges, . Costa, G. Yiannis, H. Guillaume et al., A batch scheduler with high level components, Cluster computing and Grid 2005 (CC-Grid05), 2005.

C. Julien, M. Sébastien, and L. Jacques-bernard, PaDaWAn: A Python Infrastructure for Loosely Coupled in Situ Workflows, Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV '18, pp.7-12, 2018.

C. Shi, L. Hau, and M. Kamran, Pricing Schemes in Cloud Computing: Utilization-Based versus Reservation-Based. Production and Operations Management, 2017.

C. Yang, Checkpoint and Restore of Micro-service in Docker Containers, 3rd International Conference on Mechatronics and Industrial Informatics, pp.915-918, 2015.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.

D. Jeffrey and G. Sanjay, MapReduce: Simplified Data Processing on Large Clusters, Commun. ACM, vol.51, issue.1, pp.107-113, 2008.

D. Ewa, S. Gurmeet, M. Su, B. James, G. Yolanda et al., Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems, Sci. Program, vol.13, issue.3, pp.219-237, 2005.

D. Ewa, V. Karan, J. Gideon, R. Mats, C. Scott et al., Pegasus: a Workflow Management System for Science Automation, Funding Acknowledgements: NSF ACI SDCI 0722019, NSF ACI SI2-SSI 1148515 and NSF OCI-1053575, vol.46, pp.17-35, 2015.

D. Ludwig and S. Sven, Cloud pricing: the spot market strikes back, The Workshop on Economics of Cloud Computing, 2016.

D. Estelle, C. Laurent, and R. Bruno, TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics, SCA18 -Supercomputing Frontiers Asia, 2018.

D. Ciprian, P. Manish, and K. Scott, DART: a substrate for high speed asynchronous data IO, 17th international symposium on High performance distributed computing, pp.219-220, 2008.

D. Ciprian, P. Manish, and K. Scott, DataSpaces: an Interaction and Coordination Framework for Coupled Simulation Workflows. Cluster Computing, vol.15, pp.163-181, 2012.

F. Dong, J. Luo, A. Song, and J. Jin, Resource Load Based Stochastic DAGs Scheduling Mechanism for Grid Environment, IEEE 12th International Conference on High Performance Computing and Communications (HPCC), pp.197-204, 2010.

D. Matthieu, A. Gabriel, C. Franck, S. Marc, and O. Leigh, Jitter-free I/O, Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, p.2012

D. Matthieu, A. Gabriel, R. Robert, K. Dries, and I. Shadi, CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination, IPDPS -International Parallel and Distributed Processing Symposium, 2014.

D. Matthieu, S. Roberto, P. Tom, A. Gabriel, and S. Dave, Adaptable and User-Friendly In Situ Visualization Framework, IEEE Symposium on Large Data Analysis and Visualization (LDAV), 2013.

M. Dreher and B. Raffin, A Flexible Framework for Asynchronous in Situ and in Transit Analytics for Scientific Simulations, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.277-286, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00941413

D. Matthieu and P. Tom, Bredala: Semantic Data Redistribution for In Situ Applications, Proceedings of IEEE Cluster, 2016.

P. Ifeanyi, . Egwutuoha, L. David, S. Bran, and C. Shiping, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.65, issue.3, pp.1302-1326, 2013.

N. Fabian, K. Moreland, D. Thompson, A. C. Bauer, P. Marion et al., The ParaView Coprocessing Library: A scalable, general purpose in situ visualization library, 2011 IEEE Symposium on Large Data Analysis and Visualization, pp.89-96, 2011.

D. Feitelson, Workload Data, Workload Modeling for Computer Systems Performance Evaluation, pp.22-72

L. Friedman and G. H. Glover, Report on a multicenter fMRI quality assurance protocol, Journal of Magnetic Resonance Imaging, vol.23, issue.6, pp.827-839, 2006.

G. Ana and P. Guillaume, Making Speculative Scheduling Robust to Incomplete Data, ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2019.

G. Ana, P. Guillaume, S. Hongyang, and R. Padma, ICPP 2019 -48th International Conference on Parallel Processing, 2019.

G. Ana, S. Hongyang, A. Guillaume, H. Yuankai, A. Bennett et al., On-the-fly scheduling vs. reservation-based scheduling for unpredictable workflows, International Journal of High Performance Computing Applications, 2019.

G. Bruno and J. Vincent, Comparisons of Stochastic Task-Resource Systems. In Introduction to Scheduling, 2009.

E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online Tuning of EASY-Backfilling using Queue Reordering Policies, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.10, pp.2304-2316, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

A. Goel and P. Indyk, Stochastic load balancing and related problems, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pp.579-586

M. Mucchetti, Serverless Functions with GCP, BigQuery for Data Warehousing, pp.231-251, 2020.

G. Jian, N. Akihiro, B. Ryan, Z. Haoyu, and M. Satoshi, Machine Learning Predictions for Underestimation of Job Runtime on HPC System, pp.179-198, 2018.

A. Gupta, P. Faraboschi, F. Gioachin, L. V. Kale, R. Kaufmann et al., Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud, IEEE Transactions on Cloud Computing, vol.4, issue.3, pp.307-321, 2016.

J. L. Gustafson, Reevaluating Amdahl's law, Communications of the ACM, vol.31, issue.5, pp.532-533, 1988.

H. Salman, A. P. Hal, F. Nicholas, F. Katrin, H. David et al., HACC: Simulating sky surveys on state-of-the-art supercomputing architectures, New Astronomy, vol.42, pp.49-65, 2016.

H. Paul, . Hargrove, and C. Jason, DUELL : Berkeley lab checkpoint/restart (BLCR) for Linux clusters, Journal of Physics. Conference Series, vol.46, 2006.

R. L. Harrigan, B. C. Yvernault, B. D. Boyd, S. M. Damon, K. D. Gibney et al., Vanderbilt University Institute of Imaging Science Center for Computational Imaging XNAT: A multimodal data archive and processing environment, NeuroImage, vol.124, pp.1097-1101, 2016.

Y. Rogers, . Gao, A. Bennett, and . Landman, Vanderbilt University Institute of Imaging Science Center for Computational Imaging XNAT: A multimodal data archive and processing environment, NeuroImage, vol.124, pp.1097-1101, 2016.

H. Tim, M. Maas, J. Virendra, and . Marathe, Callisto: Co-scheduling Parallel Runtime Systems, Proceedings of the Ninth European Conference on Computer Systems (EuroSys'14, vol.24, pp.1-24, 2014.

H. James, J. Swaroop, G. Andrew, C. Yaroslav, H. Bryan et al., A Common, High-Dimensional Model of the Representational Space in Human Ventral Temporal Cortex, vol.72, pp.404-420, 2011.

A. Heirich, E. Slaughter, M. Papadakis, W. Lee, T. Biedert et al., In situ visualization with task-based parallelism, Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization - ISAV'17, p.17, 2017.

H. Alan, S. Elliott, P. Manolis, L. Wonchan, B. Tim et al., In situ visualization with task-based parallelism, Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization, pp.17-21, 2017.

M. Herlihy and S. Nir, The art of multiprocessor programming, 2011.

H. Benjamin, K. Andy, Z. Matei, A. Ghodsi, A. D. Joseph et al., Mesos: A Platform for Fine-grained Resource Sharing in the Data Center, 8th USENIX Conf. Networked Systems Design and Implementation, pp.295-308, 2011.

H. Reazul, H. Thomas, B. George, and D. Jack, Dynamic Task Discovery in PaRSEC: A Data-flow Task-based Runtime, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '17, vol.6, pp.1-6, 2017.

H. Yuankai, X. U. Zhoubing, A. Katherine, P. Prasanna, B. Shunxing et al., CUTTING et Bennett A. LANDMAN : Spatially Localized Atlas Network Tiles Enables 3D Whole Brain Segmentation from Limited Data, Medical Image Computing and Computer Assisted Intervention -MICCAI, pp.698-705, 2018.

H. Yuankai, X. U. Zhoubing, X. Yunxi, A. Katherine, P. Prasanna et al., 3D Whole Brain Segmentation using Spatially Localized Atlas Network Tiles, vol.194, pp.105-119, 2019.

H. Zaeem, Z. Taieb, and M. Rami, Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '18, 2018.

H. Thomas, R. Yves, and É. , Fault-Tolerance Techniques for High-Performance Computing, 2015.

I. Yuichi, P. Tapasya, I. Koji, A. Mutsumi, R. Barry et al., Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing, SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-12, 2015.

I. Michael, B. Mihai, Y. U. Yuan, B. Andrew, and F. Dennis, Dryad: Distributed Data-parallel Programs from Sequential Building Blocks, 2nd ACM SIGOPS/EuroSys European Conf. Computer Systems, 2007.

I. Leila and K. Latifur, Implementation and performance evaluation of a scheduling algorithm for divisible load parallel applications in a cloud computing environment. Software: Practice and Experience, vol.45, 2014.

K. Hartmut, H. Thomas, A. Bryce, A. Serio, and F. Dietmar, HPX: A Task Based Programming Model in a Global Address Space, Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014.

K. Jon, R. Yuval, and T. Éva, Allocating Bandwidth for Bursty Connections, STOC, pp.664-673, 1997.

K. Brian, Scalability in the Presence of Variability, 2018.

K. Rajath and V. Sathish, Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs, Walfredo CIRNE, Narayan DESAI, Eitan FRACHTENBERG et Uwe SCHWIEGELSHOHN, éditeurs : Job Scheduling Strategies for Parallel Processing, pp.196-215, 2013.

J. Pamela, T. L. Lamontagne, J. C. Benzinger, . Morris, K. Sarah et al., OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease. medRxiv, 2019.

L. Bennett, Medical-image Analysis and Statistical Interpretation (MASI) Lab

L. Samuel, C. Philip, L. Robert, R. Robert, K. Harms et al., ALLCOCK : I/O Performance Challenges at Leadership Scale, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, 2009.

L. Matthew, A. James, A. Utkarsh, B. Eric, C. Hank et al., The ALPINE In Situ Infrastructure: Ascending from the Ashes of Strawman. In Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization, pp.42-46, 2017.

L. Matthew, H. Cyrus, K. James, P. David, J. S. Meredith et al., Performance Modeling of In Situ Rendering, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), vol.24, pp.1-24, 2016.

K. Li, X. Tang, B. Veeravalli, and K. Li, Scheduling Precedence Constrained Stochastic Tasks on Heterogeneous Cluster Systems, IEEE Transactions on Computers, vol.64, issue.1, pp.191-204, 2015.

L. I. Min, S. Sudharshan, A. R. Vazhkudai, . Butt, M. Fei et al., Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-12, 2010.

L. I. Shigang, B. , S. Di, G. Dan, A. Torsten et al., Taming unbalanced training workloads in deep learning with partial collective operations, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.45-61, 2020.

A. David, LIFKA : The ANL/IBM SP Scheduling System, JSSPP, pp.295-303, 1995.

L. Ning, J. Cope, C. Philip, C. Christopher, R. Robert et al., On the role of burst buffers in leadership-class storage systems, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012.

L. Jay, Z. Fang, L. Qing, K. Scott, R. Oldfield et al., Managing variability in the io performance of petascale storage systems, SC'10: Proceedings of the 2010

, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-12, 2010.

J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin, Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS), Proceedings of the 6th international workshop on Challenges of large applications in distributed environments - CLADE '08, pp.15-24, 2008.

L. Bertram, A. Ilkay, B. Chad, H. Dan, J. Efrat et al., Scientific Workflow Management and the Kepler System: Research Articles. Concurr. Comput. : Pract. Exper, vol.18, issue.10, pp.1039-1065, 2006.

M. A. Kwan-liu, W. Chaoli, Y. U. Hongfeng, and T. Anna, In-situ processing and visualization for ultrascale simulations, Journal of Physics: Conference Series, vol.78, issue.1, p.12043, 2007.

M. A. Xiaosong, J. Lee, and M. Winslett, High-Level Buffering for Hiding Periodic Output Cost in Scientific Simulations. Parallel and Distributed Systems, IEEE Transactions on, vol.17, issue.3, pp.193-204, 2006.

M. Preeti, V. Venkatram, K. Christopher, M. Todd, and E. Michael, PAPKA : Optimal Execution of Co-analysis for Large-scale Molecular Dynamics Simulations, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '16, vol.60, pp.1-60, 2016.

M. Preeti, V. Venkatram, M. Todd, K. Christopher, H. Mark et al., PAPKA : Optimal Scheduling of In-situ Analysis for Large-scale Scientific Simulations, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, vol.52, pp.1-52, 2015.

A. Matsunaga and J. A. Fortes, On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.495-504, 2010.

A. Mechelli, C. J. Price, K. J. Friston, and J. Ashburner, Voxel-Based Morphometry of the Human Brain: Methods and Applications, Current Medical Imaging Reviews, vol.1, issue.2, pp.105-113, 2005.

M. André, S. Mark, T. Matteo, and J. Shantenu, RADICAL-Pilot: Scalable Execution of Heterogeneous and Dynamic Workloads on Supercomputers, 2015.

M. Andrey, K. Alexey, and K. Kir, Containers checkpointing and live migration, Ottawa Linux Symposium, 2008.

R. H. Möhring, A. S. Schulz, and M. Uetz, Approximation in stochastic scheduling, Journal of the ACM, vol.46, issue.6, pp.924-942, 1999.

M. Clement, D. Matthieu, R. Bruno, and P. Tom, Automatic Data Filtering for In Situ Workflows, IEEE Cluster, 2017.

A. W. Mu'alem and D. G. Feitelson, Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling, IEEE Transactions on Parallel and Distributed Systems, vol.12, issue.6, pp.529-543, 2001.

W. Ahuva, . Mu'alem, and G. Dror, FEITELSON : Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, IEEE Trans. Parallel Distrib. Syst, vol.12, issue.6, pp.529-543, 2001.

J. Niño and M. , Stochastic Scheduling. Encyclopedia of Optimization, pp.3818-3824, 2009.

J. Woo, P. Alexey, T. , A. J. , M. A. Kozuch et al., GANGER : 3Sigma: Distribution-Based Cluster Scheduling for Runtime Uncertainty, Proceedings of the Thirteenth EuroSys Conference, EuroSys '18, 2018.

P. Tapasya, K. David, . Lowenthal, R. Barry, M. Schulz et al., Exploring hardware overprovisioning in power-constrained, high performance computing, Proceedings of the 27th international ACM conference on International conference on supercomputing, pp.173-182, 2013.

P. Tapasya, J. J. Thiagarajan, A. Ayala, Z. Tanzima, and . Islam, Performance Optimality or Reproducibility: That is the Question, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19, 2019.

P. Pébaÿ, J. C. Bennett, D. Hollman, S. Treichler, P. S. Mccormick et al., Towards Asynchronous Many-Task in Situ Data Analysis Using Legion, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.1033-1037, 2016.

P. Simon, E. Niklas, L. Stefan, R. Lukas, M. Antonello et al., Migrating LinuX Containers Using CRIU, Michela TAUFER, pp.674-684, 2016.

L. Michael, PINEDO : Scheduling: Theory, Algorithms, and Systems, 2008.

P. U. Xing, L. Ling, M. Yiduo, S. Sankaran, K. Younggyun et al., Understanding performance interference of i/o workload in virtualized cloud environments, 2010 IEEE 3rd International Conference on Cloud Computing, pp.51-58, 2010.

G. Pedro, R. Álvarez, E. E. Per-olovöstberg, A. Katie, R. Gerber et al., HPC System Lifetime Story: Workload Characterization and Evolutionary Analyses on NERSC Systems, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, pp.57-60, 2015.

R. Manuel, J. A. Moríñigo, and R. Mayo-garcía, When you have a hammer, everything looks like a nail -Checkpoint/restart in Slurm, 2017.

L. Moura, S. , J. Gabriel, and S. , System-level versus user-defined checkpointing, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No. 98CB36281), pp.68-74, 1998.

S. Ajeet, B. Pavan, and F. Wu-chun, GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading, International Conference on Parallel Processing, pp.261-268, 2009.

S. David and K. William, Understanding the causes of performance variability in hpc workloads, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, pp.137-149, 2005.

S. Joseph, C. Waiman, Z. Honbo, and A. David, LIFKA : The EASY -LoadLeveler API Project, JSSPP, pp.41-47, 1996.

S. Aishwarya and H. Muzammil, Pricing schemes in cloud computing: A review, International Journal of Advanced Computer Research, vol.7, p.2017

G. Staples, TORQUE---TORQUE resource manager, Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06, 2006.

, Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06, ACM/IEEE Conference on Supercomputing, SC '06, p.8, 2006.

S. Pradeep, D. Philip, D. Shaohua, K. Scott, K. Hemanth et al., Stacker: an autonomic data movement engine for extremescale data staging-based in situ workflows, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18), p.73, 2018.

S. Qian, J. Tong, R. Melissa, B. Hoang, Z. Fan et al., Adaptive Data Placement for Staging-based Coupled Scientific Workflows, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, vol.65, pp.1-65, 2015.

T. Mohammed, D. Brandon, A. Daniel, H. William, Y. Huichen et al., Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC '19, 2019.

K. Tang, P. Huang, X. He, T. Lu, S. S. Vazhkudai et al., Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp.87-98, 2017.

W. Tang, Z. Lan, N. Desai, D. Buettner, and Y. Yu, Reducing Fragmentation on Torus-Connected Supercomputers, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.828-839, 2011.

T. Xiaoyong, L. I. Kenli, L. Guiping, F. Kui, and W. U. Et-fan, A Stochastic Scheduling Algorithm for Precedence Constrained Tasks on Grid, Future Gener. Comput. Syst, vol.27, issue.8, pp.1083-1091, 2011.

T. Dingwen, D. I. Sheng, C. Zizhong, and C. Franck, Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization, 2017 IEEE International Parallel and Distributed Processing Symposium, pp.1129-1139, 2017.

T. Théophile, R. Alejandro, F. Yvan, I. Bertrand, and R. Bruno, Melissa: Large Scale In Transit Sensitivity Analysis Avoiding Intermediate Files, The International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing), pp.1-14, 2017.

H. Topcuoglu, S. Hariri, and . Min-you-wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, vol.13, issue.3, pp.260-274, 2002.

. Tiankai-tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud et al., A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, vol.56, p.12, 2008.

, Environmental assessment for operations, upgrades, and modifications in SNL/NM Technical Area IV, SNL-NM), 1996.

V. Kumar, V. , A. C. Murthy, D. Chris, A. Sharad et al., Apache Hadoop YARN: Yet Another Resource Negotiator, the 4th Annual Symposium on Cloud Computing, vol.5, p.16, 2013.

V. Vishwanath, M. Hereld, and M. E. Papka, Toward simulation-time data analysis and I/O acceleration on leadership-class systems, 2011 IEEE Symposium on Large Data Analysis and Visualization, pp.9-14, 2011.

N. John-von, First Draft of a Report on the EDVAC, IEEE Ann. Hist. Comput, vol.15, issue.4, pp.27-75, 1993.

W. Yi, A. Gagan, B. Tekin, and J. Wei, Smart: A MapReducelike Framework for In-situ Scientific Analytics, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, vol.51, pp.1-51, 2015.

G. Weiss, Turnpike Optimality of Smith's Rule in Parallel Machines Stochastic Scheduling, Mathematics of Operations Research, vol.17, issue.2, pp.255-270, 1992.

B. Whitlock, J. M. Favre, and J. S. Meredith, Eurographics Symposium on Parallel Graphics and Visualization (EGPGV?07), Computers & Graphics, vol.31, issue.2, p.308, 2007.

W. Samuel, A. Waterman, and P. David, Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009.

K. Wolter, Checkpointing Systems, Stochastic Models for Fault Tolerance, pp.171-176, 2010.

H. Xu and B. Li, Dynamic Cloud Pricing for Revenue Maximization, IEEE Transactions on Cloud Computing, vol.1, issue.2, pp.158-171, 2013.

Y. Keiji and A. L. , The K computer Operations: Experiences and Statistics. Procedia Computer Science, vol.29, pp.576-585, 2014.

O. Yildiz, M. Dorier, S. Ibrahim, R. Ross, and G. Antoniu, On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.750-759, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01270630

Y. Orcun, A. Chi, Z. Et-shadi, and I. Eley, On the Effectiveness of Burst Buffers for Big Data Processing in HPC systems, Cluster'17-2017 IEEE International Conference on Cluster Computing, 2017.

A. B. Yoo, M. A. Jette, and M. Grondona, SLURM: Simple Linux Utility for Resource Management, Job Scheduling Strategies for Parallel Processing, pp.44-60, 2003.

B. Andy, . Yoo, A. Morris, . Jette, and G. Mark, Slurm: Simple linux utility for resource management, Workshop on Job Scheduling Strategies for Parallel Processing, pp.44-60, 2003.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.

. Hongfeng-yu, . Chaoli-wang, R. W. Grout, J. H. Chen, and . Kwan-liu-ma, In Situ Visualization for Large-Scale Combustion Simulations, IEEE Computer Graphics and Applications, vol.30, issue.3, pp.45-57, 2010.

Z. Fan, C. Docan, M. Parashar, S. Klasky, P. Norbert et al., Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform, Parallel Distributed Processing Symposium (IPDPS), pp.1352-1363, 2012.

Z. Zhaoning, Y. Lujia, P. Yuxing, and L. I. Dongsheng, A quick survey on large scale distributed deep learning systems, IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp.1052-1056, 2018.

F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer et al., GoldRush, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, pp.1-12, 2013.

F. Zheng, H. Zou, G. Eisenhauer, K. Schwan, M. Wolf et al., FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.320-331, 2013.

Z. Fang, H. Abbasi, C. Docan, J. Lofstead, L. Qing et al., PreDatA -Preparatory Data Analytics on Peta-Scale Machines. In Parallel Distributed Processing (IPDPS), pp.1-12, 2010.

Z. Sergey, B. Sergey, and F. Alexandra, Addressing Shared Resource Contention in Multicore Processors via Scheduling, Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pp.129-142, 2010.

Z. Salah, C. Raphael-de, T. Denis, and L. Arnaud, Improving the performance of batch schedulers using online job size classification

A. Guillaume, G. Ana, H. Valentin, R. Padma, Y. R. Hongyang et al., IPDPS 2019 -33rd IEEE International Parallel and Distributed Processing Symposium, pp.166-175, 2019.

G. Ana, G. Brice, H. Valentin, P. Guillaume, R. Padma et al., Reservation and Checkpointing Strategies for Stochastic Jobs, IPDPS 2020 -34th IEEE International Parallel and Distributed Processing Symposium, vol.2020

H. Valentin, Techniques d'ordonnancement pour les applications stochastiques sur plateformes HPC

A. Guillaume, G. Brice, H. Valentin, and R. Bruno, Modeling High-throughput Applications for in situ Analytics, International Journal of High Performance Computing Applications, vol.33, issue.6, pp.1185-1200, 2019.

G. Ana, G. Brice, H. Valentin, and P. Guillaume, Profiles of upcoming HPC Applications and their Impact on Reservation Strategies. IEEE Transactions on Parallel and Distributed Systems

H. Valentin, Modeling HPC applications for in situ Analytics. IPDPS 2019 -33rd IEEE International Parallel and Distributed Processing Symposium, mai 2019. Poster

A. Guillaume, G. Ana, H. Valentin, R. Padma, Y. R. Hongyang et al., Extended Version), Reservation Strategies for Stochastic Jobs, 2018.

G. Ana, G. Brice, H. Valentin, P. Guillaume, R. Padma et al., Reservation and Checkpointing Strategies for Stochastic Jobs (Extended Version), 2019.