?. Oneview_config_file, The path to ONEVIEW configuration file containing experiment parameters. The OneView file can be generated using the following command

L. Cycles, if clean" : Loop throughput (number of cycles per iteration) if all data are in L1 and scalar integer instructions removed

L. Cycles, if fully vectorized" : Loop throughput (number of cycles per iteration) if all data are in L1 and fully vectorized

, Vector-efficiency ratio all" : Vector efficiency ratio (average proportion of used vector length) of instructions processing FP or integer elements

, Vectorization ratio all" : Vectorization ratio (proportion of vectorizable instructions that was vectorized) of instructions processing FP or integer elements

, FP op per cycle L1" : FLOPs per cycle if all data are in L1

, Speedup if clean" : speedup of instructions if clean (Cycles L1 / Cycles L1 if clean)

, Speedup if fully vectorized" : speedup of instructions if fully vectorized (Cycles L1 / Cycles L1 if fully vectorized)

, Installation Requirements Before to use ASSIST, some requirements are needed. The list below shows the required packages and libraries

, Openjdk-6-jdk: version 6 or higher (not tested with older versions)

?. Rose-libraries, Boost Libraries, available at git.maqao.org:S2S/LIBS.git ASSIST can be downloaded from the git, as well as the required libraries and a pre-compiled binary : ? binary (containing MAQAO, ASSIST and all libraries) : git clone git@git, S2S/LIBS.git ?

, ? Sources : git clone git@git.maqao.org:S2S/S2S.git

, ? Libraries required : git clone git@git.maqao.org:S2S/LIBS.git

J. K. Hollingsworth and A. Tiwari, End-to-end Auto-tuning with Active Harmony, Performance Tuning of Scientific Applications. Chapman and Hall/CRC Computational Science Series, 2010.

L. Adhianto, HPCTOOLKIT: tools for performance analysis of optimized parallel programs, vol.22, pp.685-701

. Advisor,

. Kim, Multi-level tiling M for the price of one, ACM/IEEE conference on Supercomputing, p.51, 2007.

A. Mehdi, Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators, 2012.

. Anandtech,

L. Andersien, Program Analysis and Specialization for the C Programming Language, 1994.

, APS

, ATLAS

, AVBP

D. Barthou, Performance Tuning of x86 OpenMP Codes with MAQAO, Parallel Tools Workshop. Desden, pp.95-113, 2009.

C. Bastoul, Code generation in the polyhedral model is easier than you think, International Conference on Parallel Architectures and Compilation Techniques, pp.7-16, 2004.
URL : https://hal.archives-ouvertes.fr/hal-00017260

Z. Bendifallah, PAMDA: Performance Assessment Using MAQAO Toolset and Differential Analysis, Tools for High Performance Computing 2013: Proceedings of the 7th International Workshop on Parallel Tools for High Performance Computing, pp.107-127, 2014.

Z. Bendifallah, Generalization of the decremental performance analysis to differential analysis, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01293039

A. D. Biagios, RFC] llvm-mca: a static performance analysis tool, 2018.

J. Bilmes, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, ACM International Conference on Supercomputing, pp.253-260, 2014.

U. Bondhugula, A practical automatic polyhedral parallelizer and locality optimizer, ACM SIGPLAN Notices. ACM, pp.101-113, 2008.

J. Brant and D. Roberts, The SmaCC Transformation Engine, OOPSLA '09, 2009.

M. Burtscher, PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications, pp.1-11, 2010.

A. S. Charif-rubial, MIL: A language to build program analysis tools through static binary instrumentation, 20th Annual International Conference on High Performance Computing, pp.206-215, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00920875

A. S. Charif-rubial, CQA: A code quality analyzer tool at binary level, pp.1-10, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01658710

C. Chen, J. Chame, and M. Hall, CHiLL: A Framework for Composing High-Level Loop Transformations, 2008.

D. Chen, X. D. Li, and T. Moseley, AutoFDO: Automatic Feedback-directed Optimization for Warehouse-scale Applications, Proceedings of the 2016 International Symposium on Code Generation and Optimization. CGO '16, pp.978-979, 2016.

E. Chung, L. Benini, and G. D. Micheli, Automatic Source Code Specialization for Energy Reduction, Proceedings of the 2001 International Symposium on Low Power Electronics and Design. ISLPED '01. Huntington Beach, California, pp.1-58113, 2001.

J. R. Cordy, Source Transformation, Analysis and Generation in TXL, Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation. PEPM '06, pp.1-11, 2006.

, Coria

D. , Cetus: A source-to-source compiler infrastructure for Multicores, pp.36-42, 2009.

R. Dolbeau, S. Bihan, and F. Bodin, HMPP: A hybrid multi-core parallel programming environment, Workshop on general purpose processing on graphics processing units, vol.28, 2007.

S. Donadio, A Language for the Compact Representation of Multiple Program Versions, International Workshop on Languages and Compilers for Parallel Computing, pp.136-151, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00141067

J. J. Dongarra, a set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software. ACM, pp.1-17, 1990.

A. Fog,

A. Fog,

M. Frigo and S. G. Johnson, FFTW: An adaptive software architecture for the FFT, Processings of the ICASSP Conference. Nowhere: void, p.1381, 1998.

J. Labarta-g.-ozen and E. Ayduade, MACC: Mercurium ACCeletator Model". In: International Workshop on OpenMP, 2014.

M. Geimer, The SCALASCA Performance Toolset Architecture, International Workshop on Scalable Tools for High-End Computing (STHEC), pp.51-65, 2008.

S. , Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies, International Journal of Parallel Programming. International Journal of Parallel Programming, pp.261-317, 2006.

, GNU

X. Gonze, ABINIT: First-principles approach to material and nanosystem properties, Computer Physics Communications, pp.2582-2615, 2009.

. Google,

M. W. Hall, Maximizing Multiprocessor Performance with the SUIF Compiler, IEEE Computer, 1996.

J. Hammer, Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

A. Hartono, B. Norris, and P. Sadayappan, Annotation-based empirical performance tuning using Orio, 2009 IEEE International Symposium on Parallel Distributed Processing, pp.1-11, 2009.

S. Henry, H. Bollore, and E. Oseret, Towards the Generalization of Value Profiling for High-Performance Application Optimization

J. Rentzsch,

. Irigoin, Interprocedural Analyses for Programming Environments, Workshop on Environments and Tools For Parallel Scientific Computing, 1992.

J. O.-rüthing, B. Knoop, and . Steffen, Partial dead code elimination, Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, pp.147-158, 1994.

W. Jalby, The Long and Winding Road Toward Efficient High-Performance Computing, vol.106, pp.1985-2003, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02179181

B. and N. K. Meng, Mira: A Framework for Static Performance Analysis, Cluster Computing (CLUSTER), pp.978-979, 20017.

P. Klint, T. Van-der, J. Storm, and . Vinju, RASCAL A Domain Specific Language for Source Code Analysis ad Manipulation, IEEE International Working Conference on Source Code Analysis and Manipulation, pp.168-177, 2009.

A. , Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Tools for High Performance Computing, pp.79-91, 2011.

P. Kocher, Spectre Attacks: Exploiting Speculative Execution, 2018.

S. Koliaï, Quantifying Performance Bottleneck Cost Through Differential Analysis, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS '13, pp.263-272, 2013.

S. Koliaï, Static and dynamic approach for performance evaluation of scientific codes, 2011.

O. Krzikalla, Scout: A Source-to-Source Transformator for SIMD-Optimizations, Euro-Par, pp.137-145, 2012.

L. , Automatic configuration of GCC using irace, Artificial Evolution. 2017, pp.202-216

S. Larsen, E. Witchel, and S. Amarasinghe, Techniques for Increasing and Detecting Memory Alignment, 2001.

J. Lee, H. Kim, and R. Vuduc, When Prefetching Works, When It Doesn't, and Why, vol.9, pp.1-29, 2012.

J. Baptiste-lereste and A. S. Charif-rubial,

S. Liao, Machine learning-based prefetch optimization for data center applications, Proceedings of the Conference on High Performance Computing Networking, pp.1-10, 2009.

M. Lipp, Meltdown, 2018.

G. Llort, On the usefulness of object tracking techniques in performance analysis, SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2013.

P. Lu, PIT: A Framework for Effectively Composing High-Level Loop Transformations, Computing and Informatics. Open Journal Systems, pp.943-963, 2012.

R. Bravenboer, K. T. Kalleberg, and E. Visser, Stratego/XT 0.17. A language and toolset for program transformation, Science of Computer Programming, 2008.

A. , Using Dynamic Compilation to Achieve Ninja Performance for CNN Training on Many-Core Processors, Europar. IEEE, 2018.

. Msr-tools,

. Harm, ACOTES Project: Advanced Compiler Technologies for Embedded Streaming, vol.39, pp.397-450, 2011.

R. Muth, S. Watterson, and S. Debray, Code Specialization based on Value Profiles, International Static Analysis Symposium, pp.340-359, 2000.

G. C. Necula, CIL: Intermediate language and tools for analysis and transformation of C programs, International Conference on Compiler Construction, pp.213-228, 2002.

D. Novillo, SamplePGO: The Power of Profile Guided Optimizations Without the Usability Burden, Proceedings of the 2014 LLVM Compiler Infrastructure in HPC. LLVM-HPC '14, pp.22-28, 2014.

. Openacc,

. Openc++,

, OpenMP

, GCC Instrumentation Options

, GCC Optimizations Options

. Pgo-overview,

J. N. Amaral and P. Berube, Aestimo: a feedback-directed optimization evaluation tool, 2006.

M. Palkowski and W. Bielecki, TRACO Parallelizing Compiler, Soft Computing in Computer and Information Science, pp.409-421, 2015.

V. Palomares, Combining static and dynamic approaches to model loop performance in HPC, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01293040

M. Panchenko, BOLT: A Practical Binary Optimizer for Data Centers and Beyond, 2018.

P. R. Panda, A data alignment technique for improving cache performance, Proceedings International Conference on Computer Design VLSI in Computers and Processors, pp.587-592, 1997.

. Paraformance,

. Perfexpert,

D. Plotnikov, An Automatic tool for tuning compiler optimizations, Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, pp.1-7, 2013.

D. Plotnikov, Automatic Tuning of Compiler Optimizations and Analysis of their Impact, vol.18, pp.1312-1321, 2013.

M. Popov, Piecewise Holistic Autotuning of Parallel Programs with CERE, vol.29, p.4190, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01542912

A. Porterfield, Software Methods for Improvement of Cache Performance on Supercomputer Applications, 1989.

L. Pouchet, Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time, ACM SIGPLAN Notices. ACM, pp.90-100, 2008.
URL : https://hal.archives-ouvertes.fr/hal-01257273

. Llvm-pgo-presentation,

H. William and . Press, Numerical Recipes 3rd Edition: The Art of Scientific Computing, p.521880688, 2007.

M. Puschel, SPIRAL: Code generation for DSP transforms, Proceedings of the IEEE, pp.216-231, 2005.

. Qmcpack,

. Quinlan, ROSE: Compiler Support for Object-Oriented Framework, Parallel Processing Letters, pp.215-226, 2000.

A. Petitet, R. , C. Whaley, and J. J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Computing, pp.3-35, 2000.

G. Ren, Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers, pp.65-79, 2010.

T. Schonfeld and M. Rudgyard, Steady and Unsteady Flow SimulationsUsing the Hybrid Flow Solver AVBP, AIAA Journal. AIAA ARC, pp.1378-1385, 1999.

S. Sameer, A. D. Shende, and . Malony, The Tau Parallel Performance System, vol.20, pp.287-311, 2006.

S. Srinath, Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers, IEEE 13th International Symposium on High Performance Computer Architecture, pp.63-74, 2007.

R. Suda, H. Takizawa, and S. Hirasawa, Xevtgen: Fortran code transformer generator for high performance scientific codes, International Journal of Networking and Computing, pp.263-289, 2016.

W. J. Tan, A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms, IEEE Transactions on parallel and distributed systems, vol.26, 2014.

S. F. Thiago and . Teixeira, Locus: A System and a Language for Program Optimization, Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. CGO 2019, pp.217-228, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02135657

A. Tiwari, A Scalable Auto-tuning Framework for Compiler Optimization, Parallel and Distributed Processing, pp.1-12, 2009.

. Maqao-toolsuite,

C. Valensi, A generic approach to the definition of low-level components for multi-architecture binary analysis, 2014.

S. Verdoolaege, Polyhedral Parallel Code Generation for CUDA, ACM Trans. Architec. Code Optim, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00786677

C. Lattner and . Vikram-adve, DMS/spl reg: program transformations for practical scalable software evolution, Software Engineering, ICSE 2004. Proceedings. 26th International Conference on, pp.625-634, 2004.

C. Lattner and . Vikram-adve, LLVM A compilation framework for lifelong program Analysis and Transformation, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, 2004.

. Vtune,

R. Vuduc, J. Demmel, and K. Yelick, OSKI: A library of automatically tuned sparse matrix kernels, Journal of Physics: Conference Series, p.521, 2005.

, Warp3d

. Aps-website,

C. Wu, An Overview of the Open Research Compiler, Languages and Compilers for High Performance Computing: 17th International Workshop, LCPC 2004, pp.17-31, 2004.

W. A. Wulf and S. A. Mckee, Hitting the Memory Wall: Implications of the Obvious, vol.23, pp.20-24, 1995.

X. Xiao, An Approach to Customization of Compiler Directives for Application-Specific Code Transformations, 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs, pp.99-106, 2014.

Q. Yi, POET: A Scripting Language For Applying Parameterized Source-to-source Program Transformations, Software Practice And Experience. University of Texas at, pp.675-706, 2012.