G. E. Moore, Cramming More Components Onto Integrated Circuits, Proceedings of the IEEE, pp.82-85, 1998.
DOI : 10.1109/JPROC.1998.658762

R. R. Schaller, Moore's law: past, present and future, IEEE Spectrum, vol.34, issue.6, pp.52-59, 1997.
DOI : 10.1109/6.591665

T. Project, Top #1 systems, 2015.

. Intel, 2.2: Intel microarchitecture code name sandy bridge Intel 64 and IA-32 Architectures Optimization Reference Manual, 2005.

P. Taylor, Baytrail uncore performance monitoring events Available: https://software.intel.com/en-us/articles/ baytrail-uncore-performance-monitoring-events 5, 2014.

S. R. Shenoy and A. Daniel, Intel architecture and silicon cadence: The catalyst for industry innovation, p.132, 2006.

H. Wong, Intel ivy bridge cache replacement policy Available: http://blog.stuffedcow.net, 2013.

S. Raikin, D. J. Sager, Z. Sperber, E. Krimer, O. Lempel et al., Tracking mechanism coupled to retirement in reorder buffer for indicating sharing logical registers of physical register in record indexed by logical register, p.914617, 2014.

B. Kuttana, Technology insight: Intel silvermont Available: https://software.intel.com/sites, p.13, 2013.

E. Oseret, CQA: A code quality analyzer tool at binary level, pp.62-88

. Maqao, Maqao project, p.47

J. Muir, Using the rdtsc instruction for performance monitoring, Intel Corporation, Tech. Rep, vol.17, p.29, 1997.

B. Sprunt, The basics of performance-monitoring hardware, IEEE Micro, vol.22, issue.4, pp.64-71, 2002.
DOI : 10.1109/MM.2002.1028477

D. Zaparanuks, M. Jovic, and M. Hauswirth, Accuracy of performance counter measurements in Performance Analysis of Systems and Software Profile function or loop execution time Available: https://software.intel.com/sites, ISPASS 2009. IEEE International Symposium on. IEEE Intel C++ Compiler XE 13.1 User and Reference Guides. [Online], pp.23-32, 2009.

A. C. De-melo, The new linux perf tools, Slides from Linux Kongress, p.18, 2010.

A. S. Charif, Maqao performance analysis and optimization tool

Q. Wu and O. Mencer, Evaluating Sampling Based Hotspot Detection, Architecture of Computing Systems?ARCS 2009, pp.28-39, 2009.
DOI : 10.1007/3-540-46080-2_95

J. Levon and P. Elie, Oprofile: A system profiler for linux, p.58, 2013.

J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpstra, Using papi for hardware performance monitoring on linux systems, Proc. Conf. on Linux Clusters, pp.25-27, 2001.

J. Treibig, G. Hager, and G. Wellein, LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments, 2010 39th International Conference on Parallel Processing Workshops, pp.207-216, 2010.
DOI : 10.1109/ICPPW.2010.38

A. Yasin, A Top-Down method for performance analysis and counters architecture, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp.35-44, 2014.
DOI : 10.1109/ISPASS.2014.6844459

P. Calafiura, S. Eranian, D. Levinthal, S. Kama, and R. A. Vitillo, GOoDA: The Generic Optimization Data Analyzer, Journal of Physics: Conference Series, pp.52072-52090, 2012.
DOI : 10.1088/1742-6596/396/5/052072

S. Koliaï, Z. Bendifallah, M. Tribalat, C. Valensi, J. Acquaviva et al., Quantifying performance bottleneck cost through differential analysis, Proceedings of the 27th international ACM conference on International conference on supercomputing, ICS '13, pp.263-272, 2013.
DOI : 10.1145/2464996.2465440

C. Valensi, Madras: Multi-architecture binary rewriting tool technical report, pp.2013-2031

T. Austin, E. Larson, and D. Ernst, SimpleScalar: an infrastructure for computer system modeling, Computer, vol.35, issue.2, pp.59-67, 2002.
DOI : 10.1109/2.982917

V. S. Pai, P. Ranganathan, and S. V. Adve, Rsim: An execution-driven simulator for ilp-based shared-memory multiprocessors and uniprocessors, Proceedings of the Third Workshop on Computer Architecture Education, p.19, 1997.

B. Cmelik and D. Keppel, Shade: A fast instruction-set simulator for execution profiling, p.19, 1995.

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg et al., Simics: A full system simulation platform, Computer, vol.35, issue.2, pp.50-58, 2002.
DOI : 10.1109/2.982916

A. Patel, F. Afram, S. Chen, and K. Ghose, MARSS, Proceedings of the 48th Design Automation Conference on, DAC '11, pp.1050-1055, 2011.
DOI : 10.1145/2024724.2024954

P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith et al., Mambo, ACM SIGMETRICS Performance Evaluation Review, vol.31, issue.4, pp.8-12, 2004.
DOI : 10.1145/1054907.1054910

M. T. Yourst, Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator, " in Performance Analysis of Systems & Software, ISPASS 2007. IEEE International Symposium on. IEEE, pp.23-34, 2007.

G. H. Loh, S. Subramaniam, and Y. Xie, Zesto: A cycle-level simulator for highly detailed microarchitecture exploration, " in Performance Analysis of Bibliography Systems and Software, ISPASS 2009. IEEE International Symposium on, pp.53-64, 2009.

E. Perelman, G. Hamerly, M. Van-biesbrouck, T. Sherwood, and B. Calder, Using SimPoint for accurate and efficient simulation, ACM SIGMETRICS Performance Evaluation Review, vol.31, issue.1, pp.318-319, 2003.
DOI : 10.1145/885651.781076

T. E. Carlson, W. Heirman, and L. Eeckhout, Sampled simulation of multithreaded applications, Performance Analysis of Systems and Software (IS- PASS), 2013 IEEE International Symposium on. IEEE, pp.2-12, 2013.

Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook et al., RAMP gold, Proceedings of the 47th Design Automation Conference on, DAC '10, pp.463-468, 2010.
DOI : 10.1145/1837274.1837390

D. Genbrugge, S. Eyerman, and L. Eeckhout, Interval simulation: Raising the level of abstraction in architectural simulation, HPCA, 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, pp.1-12, 2010.
DOI : 10.1109/HPCA.2010.5416636

A. Jaleel, R. S. Cohn, C. Luk, and B. Jacob, Cmp$im: A pin-based on-thefly multi-core cache simulator, Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, pp.28-36, 2008.

S. Zuckerman, J. Suetterlein, R. Knauerhase, and G. R. Gao, Using a "codelet" program execution model for exascale machines, Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, EXADAPT '11, pp.64-69, 2011.
DOI : 10.1145/2000417.2000424

M. C. Easton and R. Fagin, Cold-start vs. warm-start miss ratios, Communications of the ACM, vol.21, issue.10, pp.866-872, 1978.
DOI : 10.1145/359619.359634

J. Mccalpin, Stream benchmark, p.22, 1995.

L. W. Mcvoy and C. Staelin, lmbench: Portable tools for performance analysis, USENIX annual technical conference, pp.279-294, 1996.

D. Callahan, J. Dongarra, and D. Levine, Vectorizing compilers: a test suite and results, Proceedings. SUPERCOMPUTING '88, pp.98-105, 1988.
DOI : 10.1109/SUPERC.1988.44642

S. Maleki, An Evaluation of Vectorizing Compilers, 2011 International Conference on Parallel Architectures and Compilation Techniques, pp.372-382, 2011.
DOI : 10.1109/PACT.2011.68

W. Halimi and . Jalby, Microtools: Automating program generation and performance measurement, ICPPW, 2012. IEEE, 2012, pp.424-433

H. Wong, Measuring reorder buffer capacity Available: http://blog.stuffedcow.net, pp.111-140, 2013.

C. Akel, Y. Kashnikov, P. De-oliveira-castro, and W. Jalby, Is sourcecode isolation viable for performance characterization, Parallel Processing (ICPP), 2013 42nd International Conference on, pp.977-984, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00952290

Y. Lee and M. Hall, A Code Isolator: Isolating Code Fragments from Large Programs, Languages and Compilers for High Performance Computing, pp.164-178, 2005.
DOI : 10.1007/11532378_13

E. Petit, G. Papaure, F. Dru, and F. Bodin, Poster reception---ASTEX, Proceedings of the 2006 ACM/IEEE conference on Supercomputing , SC '06, p.27, 2006.
DOI : 10.1145/1188455.1188602

T. Sherwood, E. Perelman, and B. Calder, Basic block distribution analysis to find periodic behavior and simulation points in applications, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, pp.3-14, 2001.
DOI : 10.1109/PACT.2001.953283

P. D. Castro, C. Akel, E. Petit, M. Popov, and W. Jalby, CERE, ACM Transactions on Architecture and Code Optimization, vol.12, issue.1, pp.6-23, 2015.
DOI : 10.1145/2724717

URL : https://hal.archives-ouvertes.fr/hal-01417214

M. Popov, C. Akel, F. Conti, W. Jalby, and P. D. Castro, PCERE: Fine-Grained Parallel Benchmark Decomposition for Scalability Prediction, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.1151-1160, 2015.
DOI : 10.1109/IPDPS.2015.19

URL : https://hal.archives-ouvertes.fr/hal-01417304

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical recipes: The art of scientific computing, pp.90-93, 1992.

J. Noudohouenou, V. Palomares, W. Jalby, D. C. Wong, D. J. Kuck et al., Simsys, Proceedings of the 2013 Workshop on Rapid Simulation and Performance Evaluation Methods and Tools, RAPIDO '13, pp.28-61, 2013.
DOI : 10.1145/2432516.2432517

J. Noudohouenou, Performance prediction based on codelet driven application characterization, Versailles-Saint-Quentin-en- Yvelines, p.28, 2013.

M. H. Jamal and A. Waheed, Precise measurement of execution time of concurrent , symmetric, and short tasks, Int. CMG Conference, pp.149-160, 2008.

R. Dementiev, Monitoring integrated memory controller requests in the 2nd, 3rd and 4th generation intel core processors, " https://software.intel.com/en- us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd- 3rd-and-4th-generation-intel, p.72

. Intel, 1.5: Uncore pmu summary tables Intel Xeon Processor E5 v2 and E7 v2 Product Families Uncore Performance Monitoring Reference Manual, 1930.

S. Laha, J. H. Patel, and R. K. Iyer, Accurate low-cost methods for performance evaluation of cache memory systems, IEEE Transactions on Computers, vol.37, issue.11, pp.1325-1336, 1988.
DOI : 10.1109/12.8699

R. Jain, Techniques for experimental design, measurement, simulation, and modeling, p.31, 1991.

D. J. Lilja, Measuring computer performance: a practitioner's guide, p.31, 2000.
DOI : 10.1017/CBO9780511612398

S. Touati, Towards a statistical methodology to evaluate program speedups and their optimisation techniques, p.31, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00356529

C. Hsu and W. Feng, A power-aware run-time system for highperformance computing, Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, ser. SC '05, p.35, 2005.

K. Livingston, N. Triquenaux, T. Fighiera, J. C. Beyler, and W. Jalby, Computer using too much power? Give it a REST (Runtime Energy Saving Technology), Computer Science - Research and Development, vol.36, issue.2, pp.123-130, 2014.
DOI : 10.1007/s00450-012-0226-0

M. Horowitz, T. Indermaur, and R. Gonzalez, Low-power digital design, Proceedings of 1994 IEEE Symposium on Low Power Electronics, pp.8-11, 1994.
DOI : 10.1109/LPE.1994.573184

T. Mudge, Power: A First Class Design Constraint for Future Architectures, High Performance Computing -HiPC, pp.215-224, 2000.
DOI : 10.1007/3-540-44467-X_20

D. Brodowski, Linux kernel cpufreq subsystem

F. Talbart, Codelet tuning infrastructure, p.41, 2015.

Z. Bendifallah, W. Jalby, J. Noudohouenou, E. Oseret, V. Palomares et al., PAMDA: Performance Assessment Using MAQAO Toolset and Differential Analysis, Tools for High Performance Computing 2013, pp.107-127, 2014.
DOI : 10.1007/978-3-319-08144-1_9

S. S. Shende and A. D. Malony, The Tau Parallel Performance System, International Journal of High Performance Computing Applications, vol.20, issue.2, pp.287-311, 2006.
DOI : 10.1177/1094342006064482

M. Burtscher, B. Kim, J. R. Diamond, J. D. Mccalpin, L. Koesterke et al., PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.
DOI : 10.1109/SC.2010.41

. Acumem, Acumem threadspotter

. Lich, The scalasca performance toolset architecture, STHEC, p.58, 2008.

W. E. Nagel, A. Arnold, M. Weber, H. Hoppe, and K. Solchenbach, Vampir: Visualization and analysis of mpi resources, pp.69-80, 1996.

D. Barthou, A. C. Rubial, W. Jalby, S. Koliai, and C. Valensi, Performance Tuning of x86 OpenMP Codes with MAQAO, Parallel Tools Workshop, p.47, 2009.
DOI : 10.1007/978-3-642-11261-4_7

S. Koliai, S. Zuckerman, E. Oseret, M. Ivascot, T. Moseley et al., A Balanced Approach to Application Performance Tuning, LCPC, pp.111-125, 2009.
DOI : 10.1007/978-3-642-13374-9_8

A. S. Charif, On code performance analysis and optimisation for multicore architectures, 2012.

F. Real, M. Trumm, V. Vallet, B. Schimmelpfennig, M. Masella et al., Quantum Chemical and Molecular Dynamics Study of the Coordination of Th(IV) in Aqueous Solvent, The Journal of Physical Chemistry B, vol.114, issue.48, pp.15-913, 2010.
DOI : 10.1021/jp108061s

URL : https://hal.archives-ouvertes.fr/hal-00567307

C. Staelin and H. Packard-laboratories, lmbench: Portable tools for performance analysis, USENIX Annual Technical Conference, pp.279-294, 1996.

J. Liu, W. Yu, J. Wu, D. Buntinas, S. Kini et al., Microbenchmark performance comparison of high-speed cluster interconnects, IEEE Micro, p.48, 2004.

S. R. Alam, R. F. Barrett, J. A. Kuehn, P. C. Roth, and J. S. Vetter, Characterization of Scientific Workloads on Systems with Multi-Core Processors, 2006 IEEE International Symposium on Workload Characterization, pp.225-236, 2006.
DOI : 10.1109/IISWC.2006.302747

E. Baysal, D. Kosloff, and J. Sherwood, Reverse time migration:geophysics, " 1983. 56 [100] Gprof The gnu profiler

M. Martonosi, A. Gupta, and T. Anderson, Memspy: Analyzing memory system bottlenecks in programs, Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, pp.1-12, 1992.

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin et al., Hpctoolkit: tools for performance analysis of optimized parallel programs http://hpctoolkit.org, Concurr. Comput. : Pract. Exper, vol.226, issue.6, pp.685-701, 2010.

O. Sopeju, M. Burtscher, A. Rane, and J. Browne, Autoscope: Automatic suggestions for code optimizations using perfexpert, pp.19-25, 2011.

W. Yoo, K. Larson, L. Baugh, S. Kim, and R. H. Campbell, Adp: automated diagnosis of performance pathologies using hardware events, pp.283-294, 2012.

W. Yoo, K. Larson, S. Kim, W. Ahn, R. H. Campbell et al., Automated fingerprinting of performance pathologies using performance monitoring units (pmus), 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar '11), USENIX. Berkeley, CA: USENIX, 05 58 [106] Intel Corporation, Intel R 64 and IA-32 Architectures Optimization Reference Manual, pp.248966-248996, 2014.

A. Fog, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for intel, amd and via cpus, p.107, 2015.

. Intel, 2.3.3: Execution core (operations with data-dependant latencies), " Intel 64 and IA-32 Architectures Optimization Reference Manual, p.65, 2014.

A. Arcangeli, Aug.) Transparent hugepage support, 2010.

M. B. , R. Borkar, and S. Jourdan, Advancing moore's law on 2014 -broadwell converged core Avail- able: http://www.intel.com/content/dam, 2014.

T. S. Karkhanis and J. E. Smith, A first-order superscalar processor model, Proceedings of the 31st Annual International Symposium on Computer Architecture, p.338, 2004.

S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, A mechanistic performance model for superscalar out-of-order processors, ACM Transactions on Computer Systems, vol.27, issue.2, pp.1-3, 2009.
DOI : 10.1145/1534909.1534910

J. Treibig and G. Hager, Introducing a performance model for bandwidthlimited loop kernels, Parallel Processing and Applied Mathematics, pp.615-624, 2010.

S. Williams, A. Waterman, and D. Patterson, Roofline, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009.
DOI : 10.1145/1498765.1498785

P. Joseph, K. Vaswani, and M. J. , Construction and Use of Linear Regression Models for Processor Performance Analysis, The Twelfth International Symposium on High-Performance Computer Architecture, 2006., pp.99-108, 2006.
DOI : 10.1109/HPCA.2006.1598116

J. J. Yi, D. J. Lilja, and D. M. Hawkins, A statistically rigorous approach for improving simulation methodology, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pp.281-291, 2003.
DOI : 10.1109/HPCA.2003.1183546

D. C. Wong, V. Palomares, E. Oseret, Z. Bendifallah, M. Tribalat et al., Vp3: A vectorization potential performance prototype, " ser, pp.15-131, 2015.

J. M. Cebrián, L. Natvig, and J. C. Meyer, Performance and energy impact of parallelization and vectorization techniques in modern microprocessors, Computing, vol.2, issue.4, p.79, 2013.
DOI : 10.1007/s00607-013-0366-5

V. Moureau, From Large-Eddy Simulation to Direct Numerical Simulation of a lean premixed swirl flame: Filtered laminar flame-PDF modeling, Combustion and Flame, vol.158, issue.7, p.123, 2011.
DOI : 10.1016/j.combustflame.2010.12.004

G. C. Evans, Vector seeker, Proceedings of the 2014 Workshop on Workshop on programming models for SIMD/Vector processing, WPMVP '14, pp.14-92, 2014.
DOI : 10.1145/2568058.2568069

J. Holewinski, Dynamic trace-based analysis of vectorization potential of applications, SIGPLAN Not, p.92, 2012.

A. Rane, R. Krishnaiyer, C. J. Newburn, J. Browne, L. Fialho et al., Unification of Static and Dynamic Analyses to Enable Vectorization, pp.367-381, 2014.
DOI : 10.1007/978-3-319-17473-0_24

C. Haine, O. Aumage, P. Enguerrand, and D. Barthou, Exploring and evaluating array layout restructuration for SIMDization, pp.2014-92
URL : https://hal.archives-ouvertes.fr/hal-01070467

D. H. Bailey, Little's law and high performance computing, RNR Technical Report. Citeseer, p.96, 1997.

R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, Dynamically allocating processor resources between nearby and distant ILP, Computer Architecture Proceedings. 28th Annual International Symposium on. IEEE, pp.26-37, 2001.

O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, Runahead execution: an alternative to very large instruction windows for out-of-order processors, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pp.129-140, 2003.
DOI : 10.1109/HPCA.2003.1183532

K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark, Branch prediction, instruction-window size, and cache size: performance trade-offs and simulation techniques, IEEE Transactions on Computers, vol.48, issue.11, pp.1260-1281, 1999.
DOI : 10.1109/12.811115

J. S. Griffith, S. R. Gupta, and G. J. Hinton, Method and apparatus for binding instructions to dispatch ports of a reservation station, pp.674-102, 1997.

B. Sutanto, S. T. Srinivasan, M. C. Merten, C. Y. Lai, A. J. Christiansen et al., Method and apparatus for implementing dynamic portbinding within a reservation station, pp.7188-103, 2015.

L. Djoudi, D. Barthou, O. Tomaz, A. Charif-rubial, J. Acquaviva et al., The design and architecture of maqao profile: an instrumentation maqao module, Sixth Workshop on Explicitly Parallel Instruction Computing Architectures and Compiler Technology conjunction with the IEEE/ACM International Symposium on Code Generation and Optimization, pp.13-107, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00150672

. Intel, Legacy decode pipeline -macro-fusion Intel 64 and IA-32 Architectures Optimization Reference Manual, 0111.

C. Hewett, Back end memory bound. [Online] Available: https: //software.intel.com/en-us/forums, p.151, 2012.

. Intel, 2.2.4: The execution core Intel 64 and IA-32 Architectures Optimization Reference Manual, 0115.

G. Paoloni, How to benchmark code execution times on intel ia-32 and ia-64 instruction set architectures, Intel Corporation, 0123.

D. Burger and T. M. Austin, The SimpleScalar tool set, version 2.0, ACM SIGARCH Computer Architecture News, vol.25, issue.3, pp.13-25, 1997.
DOI : 10.1145/268806.268810

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi et al., The gem5 simulator, ACM SIGARCH Computer Architecture News, vol.39, issue.2, pp.1-7, 2011.
DOI : 10.1145/2024716.2024718

T. E. Carlson, W. Heirman, and L. Eeckhout, Sniper, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.52-128, 2011.
DOI : 10.1145/2063384.2063454

W. Heirman, T. Carlson, and L. Eeckhout, Sniper: scalable and accurate parallel multi-core simulation, 8th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES-2012). High-Performance and Embedded Architecture and Compilation Network of Excellence (HiPEAC), 2012, pp.91-94