, Les besoins en ressources de calcul des applications sont croissantes, et la puissance de calcul des systèmes de CHP augmente de manière exponentielle (cf. Figure 1.2). Cependant la croissance en puissance de calcul des plates-formes généralistes va de pair avec leur complexité d'utilisation. En particulier, la hiérarchie mémoire de ces systèmes est extrêmement large et profonde (cf. Chapitre 2), et jusqu'ici abstraite aux yeux de l'utilisateur comme un espace d'adressage plat au sein d'un noeud (cf. Chapitre 3). Pourtant nous montrons que la performance des accès à la mémoire peut varier de plusieurs ordres de magnitude selon l'utilisation que l'on en fait (cf. Chapitre 2). De plus l'écart entre la performance des accès à la mémoire et la performance de calcul brut tend à s'agrandir [105] à grande vitesse. Par conséquent, sur les plates-formes généralistes, la vitesse d'exécution des applications dépendra de la qualité de l'utilisation de la mémoire, Contexte La science et l'industrie utilisent le Calcul Haute Performance (CHP) pour résoudre des problèmes nécessitant d'immenses ressources de calcul. Parmi les acteurs qui utilisent le CHP, on trouve l'aéronautique, l'automobile, l'aérospatial, la médecine, l'armée, etc

, NVDIMM)) nécessitent actuellement d'exposer une partie de cette complexité à l'utilisateur pour en tirer des performances raisonnables. Or, il est clair que les problèmes d'optimisation des applications sur des machines complexes ne sont pas solubles par les utilisateurs qui conçoivent des applications dans une science déjà compliquée. Alors que les systèmes évoluent les applications sont de plus en plus nombreuses et complexes. Elles nécessiB.2. Simulation Numérique Aéronautique (Numerical Aerodynamic Simulation) (NAS) benchmarks-hpccg [52] : est un acrnonyme pour High Performance Conjugate Gradient, qui est un solveur de gradient conjugué pour matrices pré-conditionnées, et est présenté comme pouvant être utilisé pour vérifier le bon fonctionnement des plates-formes de CHP. La mesure des caractéristiques de l'application est prise autour de l'appel à HPCCG( A, b, x, max_iter, tolerance, niters, normr, times) dans le fichier main.cpp. L'application est exécutée avec les paramètres, Les technologies émergentes (Mémoire sur la puce du processeur (In-Package Memory) (IPM), mémoire attachée dans le réseau, RAM non-volatile (Non-Volatile Dual Inline Memory Module, 200200200.

, en limitant les exécutions dans le code à une seule taille d'ensemble de travail par thread de nmax = 256*1024*16, dans le fichier qla_bench.c. Dans ce même fichier nous prenons les mesures autour de l'inclusion #include "benchfuncs.c" du code principal de l'application.-HACCmk : Est un code de cosmologie composé d'une boucle découpée en une phase de préparation des données puis une phase de calcul

N. Les and M. Également-en,

, La classe d'une application NAS est déterminée par l'utilisateur à la compilation et agit sur la taille du problème traité

, Intel 3d-xpoint non-volatile memory

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands et al., The landscape of parallel computing research : A view from berkeley, 2006.

. Infiniband-trade-association, The infiniband architecture specification, 2000.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, Starpu : a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation : Practice and Experience, vol.23, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter et al., The nas parallel benchmarks, 1991.

I. Barandiaran, The random subspace method for constructing decision forests, vol.20, 1998.

N. Beckmann and D. Sanchez, Modeling cache performance beyond lru, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp.225-236, 2016.

C. Bienia, S. Kumar, J. Singh, and K. Li, The parsec benchmark suite : Characterization and architectural implications, Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT '08, pp.72-81, 2008.

S. Mark, M. Birrittella, R. Debbage, J. Huggahalli, T. Kunz et al., Intel R omni-path architecture : Enabling scalable, high performance fabrics, High-Performance Interconnects (HOTI), pp.1-9, 2015.

S. Blagodurov, S. Zhuravlev, A. Fedorova, and A. Kamali, A case for numa-aware contention management on multicore systems, Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pp.557-558, 2010.

E. Bernhard, . Boser, . Isabelle-m-guyon, N. Vladimir, and . Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, pp.144-152, 1992.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., Parsec : Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, vol.15, pp.36-45, 2013.

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc : a Generic Framework for Managing Hardware Affinities in HPC Applications, IEEE, éditeur : PDP 2010-The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00429889

F. Broquedis, N. Furmento, B. Goglin, P. Wacrenier, and R. Namyst, ForestGOMP : an efficient OpenMP environment for NUMA architectures, International Journal of Parallel Programming, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00496295

P. Caheny, L. Alvarez, S. Derradji, M. Valero, M. Moretó et al., Reducing cache coherence traffic with a numa-aware runtime approach, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.5, pp.1174-1187, 2018.

M. Castro, L. F. Góes, C. P. Ribeiro, M. Cole, M. Cintra et al., Méhaut : A machine learning-based approach for thread mapping on transactional memory applications, 18th International Conference on High Performance Computing, pp.1-10, 2011.

G. Chatzopoulos, R. Guerraoui, T. Harris, and V. Trigonakis, Abstracting multi-core topologies with mctop, Proceedings of the Twelfth European Conference on Computer Systems, EuroSys '17, pp.544-559, 2017.

P. Cicotti and L. Carrington, Adamant : Tools to capture, analyze, and manage data movement, International Conference on Computational Science, vol.80, pp.6-8, 2016.

P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg et al., Overview of the mpi-io parallel i/o interface, put/Output in Parallel and Distributed Computer Systems, pp.127-146, 1996.

H. M. Eduardo, M. Cruz, . Diener, A. Z. Marco, . Alves et al., Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols, Journal of Parallel and Distributed Computing, vol.74, issue.3, pp.2215-2228, 2014.

H. M. Eduardo, M. Cruz, . Diener, L. Laércio, . Pilla et al., An efficient algorithm for communication-based task mapping, Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on, pp.207-214, 2015.

L. Dagum and R. Menon, Openmp : an industry standard api for shared-memory programming, IEEE computational science and engineering, vol.5, issue.1, pp.46-55, 1998.

, 2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, 2016.

A. Melo, Slides from Linux Kongress, vol.18, 2010.

, Placement de tâches et de données intra-noeuds sur machines parallèles 135 Références bibliographiques

N. Denoyelle, Moniteurs hiérarchiques de performance, pour gérer l'utilisation des ressources partagées de la topologie, Compas, 2016.

N. Denoyelle, B. Goglin, A. Ilic, E. Jeannot, and L. Sousa, Modeling large compute nodes with heterogeneous memories with cache-aware roofline model, éditeurs : High Performance Computing systems-Performance Modeling, Benchmarking, and Simulation-8th International Workshop, vol.10724, pp.91-113, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01622582

N. Denoyelle, B. Goglin, and E. Jeannot, A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems, Springer, éditeur : Proceedings of Euro-Par 2015 : Parallel Processing Workshops, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01183083

N. Denoyelle, B. Goglin, and E. Jeannot, Modeling Non-Uniform Memory Access on Large Compute Nodes with the CacheAware Roofline Model, IEEE Trans. Parallel Distrib. Syst, vol.19, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01924951

N. Denoyelle, A. Ilic, B. Goglin, L. Sousa, and E. Jeannot, Automatic Cache Aware Roofline Model Building and Validation Using Topology Detection, NESUS Third Action Workshop and Sixth Management Committee Meeting, vol.I, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01381982

S. Derradji, T. Palfer-sollier, and J. Panziera, Axel Poudes et François Wellenreiter Atos : The bxi interconnect architecture, High-Performance Interconnects (HOTI), 2015 IEEE 23rd Annual Symposium on, pp.18-25, 2015.

T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa, Characterizing multi-threaded applications based on shared-resource contention, Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS '11, pp.76-86, 2011.

M. Diener, E. H. Cruz, and P. O. , Navaux : Locality vs. balance : Exploring data mapping policies on numa systems, 23rd Euromicro International Conference on Parallel, Distributed, and NetworkBased Processing, pp.9-16, 2015.

M. Diener, H. M. Eduardo, P. O. Cruz, and . Navaux, Anselm Busse et Hans-Ulrich Heiíß : Communication-aware process and thread mapping using online communication detection, Parallel Comput, vol.43, issue.C, pp.43-63, 2015.

M. Diener, H. M. Eduardo, L. L. Cruz, and . Pilla, Fabrice Dupros et Philippe O.A. Navaux : Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation, pp.18-36, 2015.

C. Ding and Y. Zhong, Predicting whole-program locality through reuse distance analysis, Acm Sigplan Notices, vol.38, pp.245-257, 2003.

J. Jack, P. Dongarra, A. Luszczek, and . Petitet, The linpack benchmark : past, present and future. Concurrency and Computation : practice and experience, vol.15, pp.803-820, 2003.

J. Jack, . Dongarra, W. Hans, E. Meuer, and . Strohmaier, Top500 supercomputer sites, 1994.

, Ulrich Drepper : What every programmer should know about memory, 2007.

L. Dror-g-feitelson and . Rudolph, Uwe Schwiegelshohn, Kenneth C Sevcik et Parkson Wong : Theory and practice in parallel job scheduling, Workshop on Job Scheduling Strategies for Parallel Processing, pp.1-34, 1997.

C. Fricker, P. Robert, and J. Roberts, A versatile and accurate approximation for lru cache performance, Proceedings of the 24th International Teletraffic Congress, page 8. International Teletraffic Congress, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00671937

T. Sabela-ramos-garea and . Hoefler, Modelling communications in cache coherent systems, 2013.

F. Gaud, B. Lepers, J. R. Funston, M. Dashti, A. Fedorova et al., Challenges of memory management on modern NUMA systems, Commun. ACM, vol.58, issue.12, pp.59-66, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01242202

J. Gaur, M. Chaudhuri, P. Ramachandran, and S. Subramoney, Near-optimal access partitioning for memory hierarchies with multiple heterogeneous bandwidth sources, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp.13-24, 2017.

, Placement de tâches et de données intra-noeuds sur machines parallèles 137 Références bibliographiques

A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems, 2017.

A. Gimenez, T. Gamblin, B. Rountree, A. Bhatele, I. Jusufi et al., Dissecting on-node memory access performance : A semantic approach, High Performance Computing, Networking, Storage and Analysis, SC14 : International Conference for, pp.166-176, 2014.

B. Goglin, Towards the structural modeling of the topology of nextgeneration heterogeneous cluster nodes with hwloc, Inria, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01400264

W. William-d-gropp, E. Gropp, A. Lusk, and . Skjellum, Using MPI : portable parallel programming with the messagepassing interface, vol.1, 1999.

D. Hackenberg, D. Molka, and W. E. Nagel, Comparing cache architectures and coherency protocols on x86-64 multicore smp systems, Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp.413-422, 2009.

T. Hansson, C. Oostenbrink, and . Wilfredf-van-gunsteren, Molecular dynamics simulations, vol.12, pp.190-196, 2002.

J. L. Henning, Spec cpu2006 benchmark descriptions, vol.34, pp.1-17, 2006.

, Mike Heroux : Hpccg microapp, 2007.

M. D. Hill and M. R. Marty, Amdahl's law in the multicore era, Computer, vol.41, issue.7, pp.33-38, 2008.

K. Tin and . Ho, Random decision forests, proceedings of the third international conference on, vol.1, pp.278-282, 1995.

R. D. Hornung, Livermore compiler analysis loop suite. Rapport technique, 2013.

. Classement, , 2018.

. Bibliographie,

A. Ilic, F. Pratas, and L. Sousa, Cache-aware Roofline model : Upgrading the loft, IEEE Computer Architecture Letters, vol.13, issue.1, pp.21-24, 2014.

, Evènement matériels de performance des processeurs intel, Chapitre, vol.9, 2018.

, Intel roofline model, inclus dans le logiciel intel advisor, 2018.

S. Jarp, R. Jurga, and A. Nowak, Perfmon2 : a leap forward in performance monitoring, Journal of Physics : Conference Series, vol.119, p.42017, 2008.

E. Jeannot, E. Meneses, and G. Mercier, François Tessier et Gengbin Zheng : Communication and topology-aware load balancing in charm++ with treematch, Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pp.1-8, 2013.

E. Jeannot and G. Mercier, Near-Optimal Placement of MPI processes on Hierarchical NUMA Architectures, Lecture Notes on Computer Science, vol.6272, pp.199-210, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00544346

I. Karlin, J. Keasler, and . Neely, Lulesh 2.0 updates and changes. Rapport technique, 2013.

C. Howard and . Kirsch, Method and circuit for reducing dram refresh power by reducing access transistor sub threshold leakage, US Patent, vol.6, p.769, 2005.

B. Laszlo and . Kish, End of moore's law : thermal (noise) death of integration in micro and nano electronics, Physics Letters A, vol.305, issue.3-4, pp.144-149, 2002.

, Andi Kleen : A NUMA API for LINUX, 2005.

X. Alvin-r-lebeck, H. Fan, C. Zeng, and . Ellis, Power aware page allocation, ACM Sigplan Notices, vol.35, issue.11, pp.105-116, 2000.

, Placement de tâches et de données intra-noeuds sur machines parallèles 139 Références bibliographiques

D. Lee, J. Choi, J. Kim, H. Sam, S. L. Noh et al., Lrfu : A spectrum of policies that subsumes the least recently used and least frequently used policies, IEEE transactions on Computers, vol.12, pp.1352-1361, 2001.

. Charles-e-leiserson, Fat-trees : universal networks for hardwareefficient supercomputing, IEEE transactions on Computers, vol.100, issue.10, pp.892-901, 1985.

V. Baptiste-lepers, A. Quéma, and . Fedorova, Thread and memory placement on numa systems : Asymmetry matters, USENIX Annual Technical Conference, pp.277-289, 2015.

M. Lesieur, Turbulence in fluids : stochastic and numerical modelling. Nijhoff, 1987.

Y. J. Lo, S. Williams, B. Van-straalen, T. J. Ligocki, M. J. Cordery et al., Roofline model toolkit : A practical tool for architectural and program analysis, éditeurs : High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, pp.129-148, 2015.

G. Oscar, . Lorenzo, F. Tomás, . Pena, C. José et al., Using an extended Roofline Model to understand data and thread affinities on NUMA systems, Annals of Multicore and GPU Programming, vol.1, issue.1, pp.56-67, 2014.

C. A. Mack, Fifty years of moore's law, IEEE Transactions on Semiconductor Manufacturing, vol.24, issue.2, pp.202-207, 2011.

Z. Majo, R. Thomas, and . Gross, Memory management in numa multicore systems : Trapped between cache contention and interconnect overhead, SIGPLAN Not, vol.46, issue.11, pp.11-20, 2011.

, Site web de maqao mentionnant l'outil de profilage d'application basé sur des compteurs matériels, 2018.

, Outil de mesure de performance des accès mémoire

D. John, McCalpin : Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp.19-25, 1995.

, Documents de description de l'application milcmk, 2018.

S. Mittal and J. S. Vetter, A survey of software techniques for using non-volatile memories for storage and main memory systems. Parallel and Distributed Systems, IEEE Transactions on, issue.99, pp.1-1, 2015.

J. Philip, S. Mucci, C. Browne, G. Deane, and . Ho, Papi : A portable interface to hardware performance counters, Proceedings of the department of defense HPCMP users group conference, vol.710, 1999.

C. Richard, K. B. Murphy, B. W. Wheeler, . Barrett, . James et al., Introducing the graph 500. Cray User's Group (CUG), vol.19, pp.45-74, 2010.

C. Natarajan, B. Christenson, and F. Briggs, A study of performance impact of memory controller features in multi-processor server environment, Proceedings of the 3rd Workshop on Memory Performance Issues : In Conjunction with the 31st International Symposium on Computer Architecture, WMPI '04, pp.80-87, 2004.

N. G. Andrew, Coursera machine-learning mooc, 2018.

J. Nieplocha, J. Robert, . Harrison, J. Richard, and . Littlefield, Global arrays : A portable" shared-memory" programming model for distributed memory computers, Supercomputing'94., Proceedings, pp.340-349, 1994.

F. Pellegrini, Scotch and PT-Scotch Graph Partitioning Software : An Overview, Olaf Schenk Uwe Naumann, éditeur : Combinatorial Scientific Computing, pp.373-406, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00770422

S. Perarnau, J. A. Zounmevo, B. Gerofi, K. Iskra, and P. Beckman, Exploring data migration for future deep-memory many-core systems, 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp.289-297, 2016.

, Placement de tâches et de données intra-noeuds sur machines parallèles 141 Références bibliographiques

A. Pesterev, N. Zeldovich, and R. T. Morris, Locating cache performance bottlenecks using data profiling, Proceedings of the 5th European Conference on Computer Systems, EuroSys '10, pp.335-348, 2010.

, Utilisation de la simulation et des machines parallèles pour des applications pharmaceutiques, 2018.

, Page web de la plateforme fédérative pour la recherche en informatique et mathématiques (plafrim, 2018.

, Divulgation des regitres spécifique pour désactiver les préchargeurs de données, 2018.

B. Putigny, B. Goglin, and D. Barthou, A Benchmarkbased Performance Model for Memory-bound HPC Applications, International Conference on High Performance Computing & Simulation (HPCS 2014)
URL : https://hal.archives-ouvertes.fr/hal-00985598

M. K. Qureshi and Y. N. Patt, Utility-based cache partitioning : A low-overhead, high-performance, runtime mechanism to partition shared caches, 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp.423-432, 2006.

S. Raasch and M. Schröter, Palm-a large-eddy simulation model performing on massively parallel computers, Meteorologische Zeitschrift, vol.10, issue.5, pp.363-372, 2001.

S. Ramos and T. Hoefler, Capability models for manycore memory systems : A case-study with xeon phi knl, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.297-306, 2017.

S. Rixner, W. J. Dally, J. Kapasi, P. Mattson, and J. D. Owens, Memory access scheduling. SIGARCH Comput. Archit. News, vol.28, issue.2, pp.128-138, 2000.

E. Rohou, Tiptop : Hardware Performance Counters for the Masses, 41st International Conference on Parallel Processing Workshops (ICPPW), p.2012
URL : https://hal.archives-ouvertes.fr/hal-00639173

. Bibliographie,

E. Rohou and D. Guyon, Sequential performance : Raising awareness of the gory details, Procedia Computer Science, vol.51, p.2015
URL : https://hal.archives-ouvertes.fr/hal-01162336

R. Schreiber, J. Jack, and . Dongarra, Automatic blocking of nested loops, 1990.

L. Derek, . Schuff, S. Benjamin, . Parsons, . Vijay et al., Multicore-aware reuse distance analysis, Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp.1-8, 2010.

P. Schwan, Lustre : Building a file system for 1000-node clusters, Proceedings of the 2003 Linux symposium, pp.380-386, 2003.

H. Servat, A. J. Peña, G. Llort, E. Mercadal, H. C. Hoppe et al., Automating the application data placement in hybrid memory systems, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.126-136, 2017.

J. Shalf, S. Dosanjh, and J. Morrison, Exascale computing technology challenges, Proceedings of the 9th International Conference on High Performance Computing for Computational Science, VECPAR'10, pp.1-25, 2011.

H. Shan, K. Antypas, and J. Shalf, Characterizing and predicting the i/o performance of hpc applications using a parameterized synthetic benchmark, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, p.42, 2008.

J. Shore and R. Johnson, Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy, IEEE Transactions on information theory, vol.26, issue.1, pp.26-37, 1980.

, Réponse à la question : Pourquoi doit-on mélanger les échantillons en entrée d'un algorithme de machine lear, 2018.

S. Srinath, O. Mutlu, H. Kim, N. Yale, and . Patt, Feedback directed prefetching : Improving the performance and bandwidthefficiency of hardware prefetchers. In High Performance Computer Architecture, IEEE 13th International Symposium on, pp.63-74, 2007.
DOI : 10.1109/hpca.2007.346185

URL : http://www.ece.utexas.edu/projects/hps/pub/srinath_hpca07.pdf

, Placement de tâches et de données intra-noeuds sur machines parallèles 143 Références bibliographiques

S. Tarek-m-taha and . Wills, An instruction throughput model of superscalar processors, IEEE Transactions on Computers, vol.57, issue.3, pp.389-403, 2008.

F. Tessier, P. Malakar, V. Vishwanath, E. Jeannot, and F. Isaila, Topology-aware data aggregation for intensive i/o on large-scale supercomputers, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC), pp.73-81, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01394741

V. Tiwari, S. Malik, and A. Wolfe, Power analysis of embedded software : a first step towards software power minimization. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.2, pp.437-445, 1994.

, Classement top500 de juin 2018, 2018.

G. Tournavitis, Z. Wang, B. Franke, F. P. Michael, and . O'boyle, Towards a holistic approach to auto-parallelization : Integrating profile-driven parallelism detection and machine-learning based mapping, SIGPLAN Not, vol.44, issue.6, pp.177-187, 2009.

M. Unno, S. Aono, and H. Asai, Gpu-based massively parallel 3-d hie-fdtd method for high-speed electromagnetic field simulation, IEEE Transactions on Electromagnetic Compatibility, vol.54, issue.4, pp.912-921, 2012.

O. Villa, M. Daniel-r-johnson, E. Oconnor, D. Bolotin, J. Nellans et al., Scaling the power wall : a path to exascale, High Performance Computing, Networking, Storage and Analysis, SC14 : International Conference for, pp.830-841, 2014.

J. Von-neumann, First draft of a report on the edvac, IEEE Annals of the History of Computing, issue.4, pp.27-75, 1993.

Z. Wang, F. Michael, and . Boyle, Mapping parallelism to multicores : a machine learning based approach, ACM Sigplan notices, vol.44, pp.75-84, 2009.

D. West, Introduction to graph theory, vol.2, 2001.

S. Williams, A. Waterman, and D. Patterson, Roofline : An Insightful Visual Performance Model for Multicore Architectures, Commun. ACM, vol.52, issue.4, pp.65-76, 2009.

J. R. Wilson and K. A. Lorenz, Short History of the Logistic Regression Model, pp.17-23, 2015.

. A. Wm, S. A. Wulf, and . Mckee, Hitting the memory wall : Implications of the obvious, SIGARCH Comput. Archit. News, vol.23, issue.1, pp.20-24, 1995.

S. Zhuravlev, S. Blagodurov, and A. Fedorova, Addressing shared resource contention in multicore processors via scheduling, SIGPLAN Not, vol.45, issue.3, pp.129-142, 2010.

N. Denoyelle, Moniteurs hiérarchiques de performance, pour gérer l'utilisation des ressources partagées de la topologie, Compas, 2016.

N. Denoyelle, B. Goglin, A. Ilic, E. Jeannot, and L. Sousa, Modeling large compute nodes with heterogeneous memories with cache-aware roofline model, éditeurs : High Performance Computing systems-Performance Modeling, Benchmarking, and Simulation-8th International Workshop, vol.10724, pp.91-113, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01622582

N. Denoyelle, B. Goglin, and E. Jeannot, A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems, Springer, éditeur : Proceedings of Euro-Par 2015 : Parallel Processing Workshops, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01183083

N. Denoyelle, B. Goglin, and E. Jeannot, Modeling Non-Uniform Memory Access on Large Compute Nodes with the CacheAware Roofline Model, IEEE Trans. Parallel Distrib. Syst, vol.19, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01924951

N. Denoyelle, A. Ilic, B. Goglin, L. Sousa, and E. Jeannot, Automatic Cache Aware Roofline Model Building and Validation Using Topology Detection, NESUS Third Action
URL : https://hal.archives-ouvertes.fr/hal-01381982

, Placement de tâches et de données intra-noeuds sur machines parallèles 145 Publications Workshop and Sixth Management Committee Meeting, vol.I, 2016.