P. Dubey, Recognition, mining and synthesis moves computers to the era of tera, Technology@ Intel Magazine, vol.9, issue.2, pp.1-10, 2005.

X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm, Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle, Proceedings of the conference on Operating Systems Design and Implementation, OSDI'16, pp.603-618, 2016.

K. Nagaraj, C. Killian, and J. Neville, Structured comparative analysis of systems logs to diagnose performance problems, Proceedings of the conference on Networked Systems Design and Implementation, NSDI'12, pp.26-26, 2012.

R. Nathan, J. M. Tallent, A. Mellor-crummey, and . Porterfield, Analyzing lock contention in multithreaded applications, Proceedings of the symposium on Principles and Practices of Parallel Programming, PPoPP'10, pp.269-280, 2010.

X. Yu, S. Han, D. Zhang, and T. Xie, Comprehending performance from real-world execution traces, Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, ASPLOS '14, pp.193-206, 2014.
DOI : 10.1145/2541940.2541968

F. David, G. Thomas, J. Lawall, and G. Muller, Continuously measuring critical section pressure with the free-lunch profiler, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications , OOPSLA'14, pp.291-307, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01080277

E. Altman, M. Arnold, S. Fink, and N. Mitchell, Performance analysis of idle programs, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'10, pp.739-753, 2010.

A. Jaleel, S. Robert, C. Cohn, B. Luk, and . Jacob, Cmp $ im: A pinbased on-the-fly multi-core cache simulator, Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, pp.28-36, 2008.

M. Stephan, J. Günther, and . Weidendorfer, Assessing cache false sharing effects by dynamic binary instrumentation, Proceedings of the Workshop on Binary Instrumentation and Applications, pp.26-33, 2009.

Q. Zhao, D. Koh, S. Raza, D. Bruening, W. Wong et al., Dynamic cache contention detection in multi-threaded applications, Proceedings of the international conference on Virtual Execution Environments, pp.27-38, 2011.
DOI : 10.1145/1952682.1952688
URL : http://www.cag.lcs.mit.edu/commit/papers/2011/zhao-vee11-cache-contention.pdf

S. Kristof-du-bois, J. B. Eyerman, L. Sartor, and . Eeckhout, Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior, Proceedings of the International Symposium on Computer Architecture , ISCA'13, pp.511-522, 2013.

M. Hobbel, T. Rauber, and C. Scholtes, Trace-based automatic padding for locality improvement with correlative data visualization interface, Proceedings of the International Conference on Parallel Architectures and Compilation , PACT'07, 2007.
DOI : 10.1109/ipdps.2008.4536472

T. Liu, C. Tian, Z. Hu, and E. D. Berger, PREDATOR: Predictive false sharing detection, Proceedings of the symposium on Principles and Practices of Parallel Programming, PPoPP'14, pp.3-14, 2014.

T. Liu and X. Liu, Cheetah: detecting false sharing efficiently and effectively, Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pp.1-11, 2016.
DOI : 10.1145/1952682.1952688
URL : http://dl.acm.org/ft_gateway.cfm?id=2854039&type=pdf

A. Bhatele, K. Mohror, H. Steven, K. E. Langer, and . Isaacs, There goes the neighborhood, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, pp.1-12, 2013.
DOI : 10.1145/2503210.2503247

M. Casas and G. Bronevetsky, Active measurement of the impact of network switch utilization on application performance, Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS'14, pp.165-174, 2014.

. Perf, Linux profiling with performance counters

L. Song and S. Lu, Statistical debugging for real-world performance problems, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'14, pp.561-578, 2014.
DOI : 10.1145/2714064.2660234
URL : https://minds.wisconsin.edu/bitstream/handle/1793/68592/TR1803-1.pdf?sequence=3

M. Steven-cameron-woo, E. Ohara, . Torrie, A. Singh, and . Gupta, The SPLASH-2 programs: Characterization and methodological considerations, Proceedings of the International Symposium on Computer Architecture , ISCA'95, pp.24-36, 1995.

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, Evaluating MapReduce for Multi-core and Multiprocessor Systems, 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pp.13-24, 2007.
DOI : 10.1109/HPCA.2007.346181
URL : http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf

C. Bienia, S. Kumar, K. Singh, and . Li, The PARSEC benchmark suite, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.72-81, 2008.
DOI : 10.1145/1454115.1454128

B. Fitzpatrick, Distributed caching with memcached. Linux journal, p.5, 2004.

S. Ghemawat, J. Dean, and . Leveldb, URL: http://leveldb.org, 2011.

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010.
DOI : 10.1109/PDP.2010.67
URL : https://hal.archives-ouvertes.fr/inria-00429889

C. Pousa-ribeiro, J. Mehaut, A. Carissimi, M. Castro, and L. G. Fernandes, Memory affinity for hierarchical shared memory multiprocessors, Computer Architecture and High Performance Computing SBAC-PAD'09. 21st International Symposium on, pp.59-66, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00788914

S. Blagodurov, S. Zhuravlev, A. Fedorova, and A. Kamali, A case for NUMA-aware contention management on multicore systems, Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pp.557-558, 2010.
DOI : 10.1145/1854273.1854350

M. Scott and W. Bolosky, False sharing and its effect on shared memory performance, Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS), p.57, 1993.

F. Broquedis, F. Diakhaté, S. Thibault, O. Aumage, R. Namyst et al., Scheduling Dynamic OpenMP Applications over Multicore Architectures, OpenMP in a New Era of Parallelism, pp.170-180, 2008.
DOI : 10.1007/978-3-540-79561-2_15
URL : https://hal.archives-ouvertes.fr/inria-00329934

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize et al., Traffic management: A holistic approach to memory placement on numa systems, Proceedings of the conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'13, pp.381-394, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00945758

A. Nisar, W. Liao, and A. Choudhary, Scaling parallel I/O performance through I/O delegate and caching system, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, p.9, 2008.
DOI : 10.1109/SC.2008.5214358
URL : http://users.eecs.northwestern.edu/~choudhar/Publications/ScalingParallelIOPerformanceThroughIODelgateAndCachingSystem.pdf

D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, I. Antony et al., Everest: Scaling down peak loads through i/o off-loading, OSDI, pp.15-28, 2008.

J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich et al., The case for RAMClouds, ACM SIGOPS Operating Systems Review, vol.43, issue.4, pp.92-105, 2010.
DOI : 10.1145/1713254.1713276

M. Frigo, G. Steven, and . Johnson, The Design and Implementation of FFTW3, Proceedings of the IEEE, pp.216-231, 2005.
DOI : 10.1109/JPROC.2004.840301

J. Cetnar, J. Gudowski, and . Wallenius, Mcb: A continuous energy monte carlo burnup simulation code. Actinide and fission product partitioning and transmutation, 1999.

D. Robert, U. M. Falgout, and Y. , hypre: A library of high performance preconditioners, International Conference on Computational Science, pp.632-641, 2002.

C. Bernard, T. Burch, A. Thomas, C. Degrand, S. Detar et al., Scaling tests of the improved Kogut-Susskind quark action, Physical Review D, vol.131, issue.11, p.61111502, 2000.
DOI : 10.1143/PTPS.131.573

Y. Wang, . Stocks, . Wa-shelton, . Nicholson, W. Szotek et al., Order-n multiple scattering approach to electronic structure calculations. Physical review letters, p.752867, 1995.
DOI : 10.1103/physrevlett.75.2867

J. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller, Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications, Proceedings of the Usenix Annual Technical Conference, USENIX ATC'12, pp.65-76, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00779908

J. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller, Fast and portable locking for multicore architectures, ACM Transactions on Computer Systems (TOCS), vol.3313, issue.4, pp.1-1362, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01252167

M. Herlihy, J. Eliot, and B. Moss, Transactional memory: Architectural support for lock-free data structures, 1993.
DOI : 10.1109/isca.1993.698569
URL : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-895-theory-of-parallel-systems-sma-5509-fall-2003/readings/herlihy_mo93.pdf

J. Tao and W. Karl, CacheIn: A Toolset for Comprehensive Cache Inspection, Proceedings of the International Conference on Computational Science, ICCS'05, pp.174-181, 2005.
DOI : 10.1007/11428848_22

T. Liu and E. D. Berger, SHERIFF: Precise detection and automatic mitigation of false sharing, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, pp.3-18, 2011.

M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer et al., Whose cache line is it anyway?, Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pp.141-154, 2013.
DOI : 10.1145/2465351.2465366

J. Dongarra, K. London, S. Moore, P. Mucci, D. Terpstra et al., Experiences and lessons learned with a portable interface to hardware performance counters, Proceedings International Parallel and Distributed Processing Symposium, pp.289-291, 2003.
DOI : 10.1109/IPDPS.2003.1213517

V. Weaver, S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe et al., Detection of false sharing using machine learning, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.1-9, 2013.

A. Pesterev, N. Zeldovich, and R. T. Morris, Locating cache performance bottlenecks using data profiling, Proceedings of the 5th European conference on Computer systems, EuroSys '10, pp.335-348, 2010.
DOI : 10.1145/1755913.1755947
URL : http://pdos.csail.mit.edu/papers/dprof:eurosys10.pdf

J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa, Contention aware execution, Proceedings of the 8th annual IEEE/ ACM international symposium on Code generation and optimization, CGO '10, pp.257-265, 2010.
DOI : 10.1145/1772954.1772991

C. Xu, X. Chen, R. Dick, and Z. Mao, Cache contention and application performance prediction for multi-core systems, Proceedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS'10, pp.76-86, 2010.

B. Teabe, A. Tchana, and D. Hagimont, Application-specific quantum for multi-core platform scheduler, Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pp.1-3, 2016.
DOI : 10.1109/PCCC.2012.6407650
URL : https://hal.archives-ouvertes.fr/hal-01782587

S. Ganesh-ananthanarayanan, S. Agarwal, A. Kandula, I. Greenberg, D. Stoica et al., Scarlett, Proceedings of the sixth conference on Computer systems, EuroSys '11, pp.287-300, 2011.
DOI : 10.1145/1966445.1966472

Y. Oh, J. Choi, D. Lee, H. Sam, and . Noh, Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems, FAST, 2012.

R. Lachaize, B. Lepers, and V. Quéma, Memprof: A memory profiler for numa multicore systems, Proceedings of the Usenix Annual Technical Conference, USENIX ATC'12, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00945731

N. Joukov, A. Traeger, R. Iyer, P. Charles, E. Wright et al., Operating system profiling via latency analysis, Proceedings of the conference on Operating Systems Design and Implementation, OSDI'06, pp.89-102, 2006.
DOI : 10.1145/1095810.1118607

S. Eyerman, K. D. Bois, and L. Eeckhout, Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications, 2012 IEEE International Symposium on Performance Analysis of Systems & Software, pp.145-155, 2012.
DOI : 10.1109/ISPASS.2012.6189221
URL : http://users.elis.ugent.be/~leeckhou/papers/ispass12_2.pdf

F. Trahay, Y. Ishikawa, F. Rue, R. Namyst, M. Faverge et al., EZTrace: A Generic Framework for Performance Analysis, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.618-619, 2011.
DOI : 10.1109/CCGrid.2011.83
URL : https://hal.archives-ouvertes.fr/inria-00587216

K. Coulomb, . Faverge, . Jazeix, . Lagrasse, . Marcoueille et al., Visual trace explorer (vite), 2009.

H. Casanova, A. Legrand, and M. Quinson, SimGrid: A Generic Framework for Large-Scale Distributed Experiments, Tenth International Conference on Computer Modeling and Simulation (uksim 2008), pp.126-131, 2008.
DOI : 10.1109/UKSIM.2008.28
URL : https://hal.archives-ouvertes.fr/inria-00260697

G. Southern and J. Renau, Analysis of PARSEC workload scalability, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp.133-142, 2016.
DOI : 10.1109/ISPASS.2016.7482081

M. Roth, J. Micah, C. Best, A. Mustard, and . Fedorova, Deconstructing the overhead in parallel applications, 2012 IEEE International Symposium on Workload Characterization (IISWC), pp.59-68, 2012.
DOI : 10.1109/IISWC.2012.6402901

A. Michael, . Frumkin, V. Leonid, and . Shabanov, Benchmarking memory performance with the data cube operator, 2004.

M. Zhuang and B. Aker, memaslap: Load testing and benchmarking a server

G. Gauthier-voron, V. Thomas, P. Quéma, and . Sens, An interface to implement NUMA policies in the xen hypervisor, Proceedings of the EuroSys European Conference on Computer Systems, EuroSys'17, p.14, 2017.