A. Christophe, F. Pierre, and B. Dominique, Parallel birth and death process for cell nuclei extraction in histopathology images, Parallel Processing (ICPP), 2013 42nd International Conference on, pp.429-438, 2013.

J. Frank, . Aherne, A. Neil, . Thacker, I. Peter et al., The bhattacharyya metric as an absolute similarity measure for frequency coded data, vol.34, pp.363-368, 1998.

B. Cédric, Generating loops for scanning polyhedra. PRiSM, Versailles University, vol.23, 2002.

B. Cédric, Improving Data Locality in Static Control Programs, 2004.

B. Cédric, Extracting polyhedral representation from high level languages. Tech. rep. Related to the Clan tool, 2008.

B. Cédric, Openscop: A specification and a library for data exchange in polyhedral compilation tools, 2011.

B. Cédric, Contributions to high-level program optimization, 2012.

B. Cédric, H. David, E. Bailey, . Barszcz, T. John et al., Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers, Cloog: The chunky loop generator, vol.5, pp.85-102, 1991.

B. Uday, B. Vinayaka, and P. Irshad, Diamond tiling: Tiling techniques to maximize parallelism for stencil computations, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.5, pp.1285-1298, 2017.

J. C. Herman and . Berendsen, Bio-molecular dynamics comes of age, Science, vol.271, issue.5251, pp.954-954, 1996.

B. Uday, H. Albert, R. Jagannathan, and S. Ponnuswamy, A practical automatic polyhedral parallelizer and locality optimizer, In ACM SIGPLAN Notices, vol.43, pp.101-113, 2008.

J. Marsha, . Berger, and O. Joseph, Adaptive mesh refinement for hyperbolic partial differential equations, Journal of computational Physics, vol.53, issue.3, pp.484-512, 1984.

B. Uday, Pluto-an automatic parallelizer and locality optimizer for multicores, 2009.

B. Uday, Compiling affine loop nests for distributed-memory parallel architectures, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.33, 2013.

B. Rainer, R. Claes, H. U. Dingming, T. Linda, and . Smith, Salinitydriven thermocline transients in a wind-and thermohaline-forced isopycnic coordinate model of the north atlantic, Journal of Physical Oceanography, vol.22, issue.12, pp.1486-1505, 1992.

A. Thomas, S. Cortese, and . Balachandar, High performance spectral simulation of turbulent flows in massively parallel machines with distributed memory. The International journal of supercomputer applications and high performance computing, vol.9, pp.187-204, 1995.

C. Chun, C. Jacqueline, and H. Mary, Chill: A framework for composing high-level loop transformations, 2008.

C. Pierre-nicolas and G. Jens, Experimenting iterative computations with ordered read-write locks, éditeurs : 18th Euromicro International Conference on Parallel, Distributed and network-based Processing, pp.155-162, 2010.

C. Pierre-nicolas and G. Jens, Iterative computations with ordered read-write locks, Journal of Parallel and Distributed Computing, vol.70, issue.5, pp.496-504, 2010.

C. Jason, H. Muhuan, and Z. Yi, Accelerating fluid registration algorithm on multi-FPGA platforms, Field Programmable Logic and Applications (FPL), 2011 International Conference on, pp.50-57, 2011.

J. Cohen and M. Jeroen, A fast double precision cfd code using cuda. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, pp.414-429, 2009.

C. Matthias, O. Schenk, B. Helmar, C. Matthias, S. Olaf et al., Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures, Maarten PAULIDES et Helmar BURKHART : Manycore stencil computations in hyperthermia applications. Scientific Computing with Multicore and Accelerators, pp.255-277, 2010.

C. Jason and Z. Yi, Lithographic aerial image simulation with FPGAbased hardware acceleration, Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays, pp.67-76, 2008.

C. Craig, J. Douglas, . Hu, K. Markus, R. Ulrich et al., Cache optimization for structured and unstructured grid multigrid, Electronic Transactions on Numerical Analysis, vol.10, pp.21-40, 2000.

A. David, J. Hennessy, D. Kaushik, K. Shoaib, W. Samuel et al., Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, Computer organization and design: the hardware/software interface. San mateo, CA: M organ Kaufmann Publishers, 1:998, vol.51, p.122, 1999.

D. Roshan, R. Chandan, R. Thejas, and U. Bond-hugula, Generating efficient data movement code for heterogeneous architectures with distributed-memory, Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on, pp.375-386, 2013.

E. Christopher, M. Matthew, B. Abhishek, C. James, and . Sutherland, Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations, Journal of Systems and Software, vol.125, pp.389-400, 2017.

M. Frigo, G. Steven, and . Johnson, The design and implementation of fftw3, Proceedings of the IEEE, vol.93, issue.2, pp.216-231, 2005.

F. Matteo, E. Charles, H. Leiserson, . Prokop, and R. Sridhar, Cache-oblivious algorithms, Foundations of Computer Science, 1999. 40th Annual Symposium on, pp.285-297, 1999.

F. Matthew, R. Mike, R. John, S. Adam, and X. Yingqi, Manticore: A heterogeneous parallel language, Proceedings of the 2007 workshop on Declarative aspects of multicore programming, pp.37-44, 2007.

M. Frigo and V. Strumpen, Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations, Proc. of the 19th ACM International Conference on Supercomputing (ICS05), 2005.

G. Georgios, A. Maria, K. Nectarios, G. Tobias, C. Albert et al., Split tiling for gpus: automatic parallelization using trapezoidal tiles, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, vol.14, pp.24-31, 2003.

G. Jens, J. Emmanuel, and A. Michael, Relaxed synchronization with ordered read-write locks, éditeurs : Euro-Par 2011: Parallel Processing Workshops, vol.7155, pp.387-397, 2011.

G. Jens, J. Emmanuel, and M. Farouk, Fully-abstracted affinity optimization for task-based models

G. Jens, J. Emmanuel, and M. Farouk, Optimizing locality by topology-aware placement for a task based programming model, Cluster Computing (CLUSTER), 2016 IEEE International Conference on, pp.164-165, 2016.

G. Pieter, P. K?osiewicz, and V. Wim, Improving the arithmetic intensity of multigrid with the help of polynomial smoothers, Numerical Linear Algebra with Applications, vol.19, issue.2, pp.253-267, 2012.

G. Jens, V. Stéphane, M. Patrick, H. Mary, C. Jacqueline et al., Resource Centered Computing delivering high parallel performance, Heterogeneity in Computing Workshop (HCW 2014), workshop of 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014), 2014.

M. Murtaza and K. , Loop transformation recipes for code generation and auto-tuning, International Workshop on Languages and Compilers for Parallel Computing, pp.50-64, 2009.

H. William, D. Andrew, and S. Klaus, Vmd: visual molecular dynamics, Journal of molecular graphics, vol.14, issue.1, pp.33-38, 1996.

H. Tom, H. Justin, V. Richard, F. Franz, P. Louis-noël et al., A domain-specific language and compiler for stencil computations on shortvector simd and gpu architectures

H. Justin, P. Louis-noël, S. Ponnuswamy, H. Tom, K. Stock et al., Data layout transformation for stencil computations on short-vector simd architectures, Proceedings of the 27th international ACM conference on International conference on supercomputing, vol.42, pp.1-12, 2007.

K. Kadau, P. S. Lomdahl, B. L. Holian, T. C. Germann, D. Kadau et al., Molecular-dynamics study of mechanical deformation in nano-crystalline aluminum. Metallurgical and materials transactions A, vol.35, pp.2719-2723, 2004.

K. Markus and W. Christian, Dimepack-a cache-optimized multigrid library, PROC. OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICA-TIONS (PDPTA 2001), VOLUME I. Citeseer, 2001.

L. Christian, A. Sven, B. Matthias, G. Armin, H. Frank et al., Exastencils: advanced stencil-code engineering, European Conference on Parallel Processing, pp.553-564, 2014.

L. Xavier, D. Damien, G. Jacques, R. Didier, and V. Jérôme, The objective caml system release 3.11. Documentation and user's manual. INRIA, 2008.

L. Vincent, Polylib: A library for manipulating parameterized polyhedra, 1999.

L. I. Zhiyuan and S. Yonghong, Automatic tiling of iterative stencil loops, ACM Transactions on Programming Languages and Systems (TOPLAS), vol.26, issue.6, pp.975-1028, 2004.

P. Mccord, M. Herman, and F. , Methods of theoretical physics, 1946.

, Paulius MICIKEVICIUS : 3d finite difference computation on gpus using cuda, Proceedings of 2nd workshop on general purpose processing on graphics processing units, pp.79-84, 2009.

M. Simanta, C. Suresh, J. Kothari, A. Cho, and . Krishnaswamy, Paragent: A domain-specific semi-automatic parallelization tool, International Conference on High-Performance Computing, pp.141-148, 2000.

M. Naoya, N. Tatsuo, K. Sato, and M. Satoshi, Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p.11, 2011.

M. Jiayuan and S. Kevin, A performance study for iterative stencil loops on gpus with ghost zone optimizations, International Journal of Parallel Programming, vol.39, issue.1, pp.115-142, 2011.

R. Teja, M. Vinay, V. Uday, and B. , Polymage: Automatic optimization for image processing pipelines, In ACM SIGARCH Computer Architecture News, vol.43, pp.429-443, 2015.

M. Hua and W. Chao-yang, Large-scale simulation of polymer electrolyte fuel cells by parallel computing, Chemical Engineering Science, vol.59, issue.16, pp.3331-3343, 2004.

N. Aiichiro, K. Rajiv, . Kalia, V. Priya, N. Anthony et al., Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers, High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, vol.83, pp.1-13, 1994.

, CUDA NVIDIA : Programming guide, 2010.

P. Zhelong and E. Rudolf, Fast and effective orchestration of compiler optimizations for automatic performance tuning, Proceedings of the International Symposium on Code Generation and Optimization, pp.319-332, 2006.

H. Everett, . Phillips, and F. Massimiliano, Implementing the himeno benchmark with cuda on gpu clusters, Parallel & Distributed Processing (IPDPS), pp.1-10, 2010.

P. Sander, P. Szilárd, S. Roland, L. Per, B. Pär et al., Gromacs 4.5: a high-throughput and highly parallel open source molecular simulation toolkit, Bioinformatics, vol.29, issue.7, pp.845-854, 2013.

D. Alyson, . Pereira, R. Luiz, F. W. Luís, and . Góes, Pskel: A stencil programming framework for cpu-gpu systems, Concurrency and Computation: Practice and Experience, vol.27, issue.17, pp.4938-4953, 2015.

C. Dennis and . Rapaport, The art of molecular dynamics simulation, 2004.

R. +-13]-jonathan, B. Connelly, A. Andrew, P. Sylvain, D. Frédo et al., Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, ACM SIGPLAN Notices, vol.48, issue.6, pp.519-530, 2013.

R. Lakshminarayanan, K. Daegon, R. Sanjay, M. Mills, and S. , Parameterized tiled loops for free, ACM SIGPLAN Notices, vol.42, pp.405-414, 2007.

R. Gabriel and T. Chau-wen, Tiling optimizations for 3d scientific computations, Proceedings of the 2000 ACM/IEEE conference on Supercomputing, p.32, 2000.

S. Sriram and C. Siddhartha, Cache-efficient multigrid algorithms. The International Journal of High Performance Computing Applications, vol.18, pp.115-133, 2004.

S. Mariem, G. Jens, and M. Gilles, Automatic Code Generation for Iterative Multi-dimensional Stencil Computations, Anne BENOÎT, édi-teur : High Performance Computing, Data, and Analitics, p.2016

S. Daniel, G. Jens, R. Daniel, and P. Isabelle, Resource-Centered Distributed Processing of Large Histopathology Images, 19th IEEE International Conference on Computational Science and Engineering, 2016.

E. John, . Stone, G. David, and S. Guochun, Opencl: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, vol.12, issue.3, pp.66-73, 2010.

S. Yonghong, L. I. Et-zhiyuan, T. Ananta, C. Chun, C. Jacqueline et al., A scalable auto-tuning framework for compiler optimization, New tiling techniques to improve cache temporal locality, vol.34, pp.1-12, 1999.

T. Yuan, C. Alam, C. Bradley, C. Kuszmaul, . Luk et al., The pochoir stencil compiler, Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pp.117-128, 2002.

T. Yuan, C. Rezaul, C. Luk, E. Charles, and . Leiserson, Coding stencil computations using the pochoir stencil-specification language, Poster session presented at the 3rd USENIX Workshop on Hot Topics in Parallelism, 2011.

T. Ananta, A. Michael, L. Laurenzano, . Carrington, and S. Allan, Auto-tuning for energy usage in scientific applications, European Conference on Parallel Processing, pp.178-187, 2011.

D. Tschumperlé, The cimg library: http://cimg. sourceforge. net. The C++ Template Image Processing Library, 2004.

T. Allen, R. Korada, and . Umashankar, The finite-difference timedomain method for numerical modeling of electromagnetic wave interactions, Electromagnetics, vol.10, issue.1-2, pp.105-126, 1990.

U. Didem, C. Xing, B. Scott, V. Baden-;-sven, C. Juan et al., Mint: realizing cuda performance in 3d stencil methods with annotated c, Proceedings of the international conference on Supercomputing, pp.214-224, 2011.

C. Gomez, T. Christian, and C. Francky, Polyhedral parallel code generation for cuda, ACM Transactions on Architecture and Code Optimization (TACO), vol.9, issue.4, p.54, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00786677

V. Richard, W. James, . Demmel, A. Katherine, and . Yelick, Oski: A library of automatically tuned sparse matrix kernels, Journal of Physics: Conference Series, vol.16, p.521, 2005.

V. Sven, W. Samuel, J. C. Leonid, O. John, S. Katherine et al., Lattice boltzmann simulation optimization on leading multicore platforms, International Congress on Mathematical Software, pp.1-14, 2008.

R. Clinton, W. Jack, and J. Dongarra, Automatically tuned linear algebra software, Supercomputing, 1998. SC98. IEEE/ACM Conference on, pp.38-38, 1998.

W. Gerhard, H. Georg, Z. Thomas, W. Markus, and . Et,

F. Holger, Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization, Computer Software and Applications Conference, vol.1, pp.579-586, 2009.

W. Niklaus, Program development by stepwise refinement, Communications of the ACM, vol.14, issue.4, pp.221-227, 1971.

G. Brian, J. Williams, G. Eleanor, H. Catherine, M. Wayne et al., The potential impact of male circumcision on hiv in sub-saharan africa, PLoS medicine, vol.3, issue.7, p.262, 2006.

W. Michael, More iteration space tiling, Proceedings of the 1989

, ACM/IEEE conference on Supercomputing, pp.655-664, 1989.

W. David, Using time skewing to eliminate idle time due to memory bandwidth and network limitations, Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pp.171-180, 2000.

W. U. Qiang, Y. Canqun, T. Tao, and X. Et-liquan, Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system, Journal of Parallel and Distributed Computing, vol.73, issue.12, pp.1592-1604, 2013.

X. Jingling, Loop tiling for parallelism, vol.575, 2012.

Z. Xing, G. Jean-pierre, M. Jesús, G. Robert, H. Kuhn et al., Hierarchical overlapped tiling, Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp.207-218, 2012.

Z. Yongpeng, M. Frank, Z. Thomas, W. Gerhard, N. Aditya et al., Introducing a parallel cache oblivious blocking approach for the lattice boltzmann method, Proceedings of the Tenth International Symposium on Code Generation and Optimization, vol.8, pp.179-188, 2008.