, void * local , 11 point2d_t * global_point , point2d_t * local_point , 12 size_t elem_size , int width , int height , dma_event_t * event ) ; * local , void * global , 16 point2d_t * global_point , point2d_t * local_point , 17 size_t elem_size , int width , int height , dma_event_t * event )

M. Q. Ho, B. Tourancheau, and C. Obrecht, Benoît Dupont de Dinechin, and Jérôme Reybert. MPI communication on MPPA many-core NoC: design, modeling and performance issues

, Parallel Computing: On the Road to Exascale, Proceedings of the International Conference on Parallel Computing, vol.27, pp.113-122, 2015.

J. Hascoët, B. Dupont-de-dinechin, P. Guironnet-de-massas, and M. Q. Ho, Asynchronous one-sided communications and synchronizations for a clustered manycore processor, Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia, pp.51-60, 2017.

M. Q. Ho, C. Obrecht, B. Tourancheau, B. Dupont-de-dinechin, and J. Hascoet, Improving 3D lattice Boltzmann method stencil with asynchronous transfers on manycore processors, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC) (IPCCC 2017), 2017.
URL : https://hal.archives-ouvertes.fr/hal-01652614

M. Q. Ho, C. Obrecht, and B. Tourancheau, New parallel in-place update algorithm for better memory usage in 3D lattice Boltzmann algorithm, 2017.

M. Q. Ho, B. Dupont-de-dinechin, B. Tourancheau, and C. Obrecht, BLIS-RDMA: A portable and high performance level-3 BLAS for DMA-based many-core architectures, 2017.

S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009.

, OpenACC: More Science Less Programming, 2018.

, Editor : Aaftab Munshi. The OpenCL Specification. Version 1, 2012.

D. William, . Gropp, L. Ewing, A. Lusk, and . Skjellum, Using MPI: portable parallel programming with the message-passing interface, vol.1, 1999.

J. Ragan-kelley, C. Barnes, A. Adams, S. Paris, F. Durand et al., Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, ACM SIGPLAN Notices, vol.48, issue.6, pp.519-530, 2013.

R. Keryell, R. Reyes, and L. Howes, Khronos SYCL for OpenCL: a tutorial, Proceedings of the 3rd International Workshop on OpenCL, p.24, 2015.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, vol.23, issue.2, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, In ACM SIGPLAN Notices, vol.48, pp.167-176, 2013.

P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, Algorithmbased fault tolerance for dense matrix factorizations, ACM SIGPLAN Notices, vol.47, issue.8, pp.225-234, 2012.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.

H. Wunderlich, C. Braun, and S. Halder, Efficacy and efficiency of algorithm-based fault-tolerance on GPUs, On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International, pp.240-243, 2013.

P. Rech, C. Aguiar, L. Frost, and . Carro, An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs, IEEE Transactions on Nuclear Science, vol.60, issue.4, pp.2797-2804, 2013.

U. Frisch, B. Hasslacher, and Y. Pomeau, Lattice-gas automata for the Navier-Stokes equation, Physical review letters, vol.56, issue.14, p.1505, 1986.

R. Guy, G. Mcnamara, and . Zanetti, Use of the Boltzmann equation to simulate lattice-gas automata, Physical review letters, vol.61, issue.20, p.2332, 1988.

N. Cao, S. Chen, J. Shi, and D. Martinez, Physical symmetry and lattice symmetry in the lattice Boltzmann method, Physical Review E, vol.55, issue.1, p.21, 1997.

F. Massaioli and G. Amati, Achieving high performance in a LBM code using OpenMP, The Fourth European Workshop on OpenMP, 2002.

S. Mcintosh, -. Smith, M. Boulton, D. Curran, and J. Price, On the performance portability of structured grid codes on many-core computer architectures, Supercomputing, pp.53-75, 2014.

C. Obrecht, B. Tourancheau, and F. Kuznik, Performance Evaluation of an OpenCL Implementation of the Lattice Boltzmann Method on the Intel Xeon phi, Parallel Processing Letters, vol.25, issue.03, p.1541001, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01286306

T. Pohl, M. Kowarschik, J. Wilke, K. Iglberger, and U. Rüde, Optimization and profiling of the cache performance of parallel lattice Boltzmann codes, Parallel Processing Letters, vol.13, issue.04, pp.549-560, 2003.

K. Mattila, J. Hyväluoma, T. Rossi, M. Aspnäs, and J. Westerholm, An efficient swap algorithm for the lattice Boltzmann method, Computer Physics Communications, vol.176, issue.3, pp.200-210, 2007.

K. Mattila, J. Hyväluoma, J. Timonen, and T. Rossi, Comparison of implementations of the lattice-Boltzmann method, Computers & Mathematics with Applications, vol.55, issue.7, pp.1514-1524, 2008.

M. Wittmann, T. Zeiser, G. Hager, and G. Wellein, Comparison of different propagation steps for lattice Boltzmann methods, Computers & Mathematics with Applications, vol.65, issue.6, pp.924-935, 2013.

P. Bailey, J. Myre, D. C. Stuart, D. J. Walsh, M. O. Lilja et al., Accelerating lattice Boltzmann fluid flow simulations using graphics processors, Parallel Processing, 2009. ICPP'09. International Conference on, pp.550-557, 2009.

M. Geier and M. Schönherr, Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel Hardware, Computation, vol.5, issue.2, p.19, 2017.

C. Obrecht, F. Kuznik, B. Tourancheau, and J. Roux, Efficient GPU implementation of the linearly interpolated bounce-back boundary condition, Computers & Mathematics with Applications, vol.65, issue.6, pp.936-944, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00731150

B. Dupont-de-dinechin, R. Ayrignac, P. Beaucamps, P. Couvert, B. Ganne et al.,

F. Riss, A clustered manycore processor architecture for embedded and accelerated applications, High Performance Extreme Computing Conference (HPEC), 2013 IEEE, pp.1-6, 2013.

H. Fu, J. Liao, J. Yang, L. Wang, Z. Song et al., The Sunway TaihuLight supercomputer: system and applications, Science China Information Sciences, vol.59, issue.7, p.72001, 2016.

L. Chuck, R. J. Lawson, . Hanson, R. David, F. T. Kincaid et al., Basic linear algebra subprograms for Fortran usage, ACM Transactions on Mathematical Software (TOMS), vol.5, issue.3, pp.308-323, 1979.

J. Jack, J. D. Dongarra, S. Croz, I. S. Hammarling, and . Duff, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software (TOMS), vol.16, issue.1, pp.1-17, 1990.

E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. Mckenney et al., LAPACK: A portable linear algebra library for high-performance computers, Proceedings of the, 1990.

, ACM/IEEE conference on Supercomputing, pp.2-11, 1990.

J. Choi, J. Jack, R. Dongarra, D. Pozo, and . Walker, ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers, Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the, pp.120-127, 1992.

. John-d-mccalpin, A survey of memory bandwidth and machine balance in current high performance computers, IEEE TCCA Newsletter, pp.19-25, 1995.

A. Petitet and J. Dongarra, HPL -A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers, 2008.

J. Jack, P. Dongarra, A. Luszczek, and . Petitet, The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience, vol.15, pp.803-820, 2003.

, Intel. Intel Math Kernel Library, 2017.

, Advanced Micro Devices (AMD), 2017.

K. Goto and R. Van-de-geijn, High-performance implementation of the level-3 BLAS, ACM Transactions on Mathematical Software (TOMS), vol.35, issue.1, p.4, 2008.

K. Goto and . Robert-a-geijn, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software (TOMS), vol.34, issue.3, p.12, 2008.

Z. Xianyi, W. Qian, and W. Saar, OpenBLAS, 2017.

C. Whaley, A. Petitet, and J. J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Computing, vol.27, issue.1, pp.3-35, 2001.

A. Yarkhan, J. Kurzak, P. Luszczek, and J. Dongarra, Porting the plasma numerical library to the openmp standard, International Journal of Parallel Programming, vol.45, issue.3, pp.612-633, 2017.

R. Field-g-van-zee and . Van-de-geijn, BLIS: A framework for rapidly instantiating BLAS functionality, ACM Transactions on Mathematical Software (TOMS), vol.41, issue.3, p.14, 2015.

, GPU-accelerated standard BLAS library, 2017.

, A software library containing BLAS functions written in OpenCL, 2017.

, Next generation BLAS implementation for ROCm platform, 2017.

S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.232-240, 2010.

C. Nugteren, CLBlast: A Tuned OpenCL BLAS Library. CoRR, 2017.

A. Abdelfattah, D. E. Keyes, and H. Ltaief, KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators, 2014.

E. Field-van-zee, R. Chan, E. Van-de-geijn, G. Quintana, and . Quintana-orti, Introducing: The libflame library for dense matrix computations, Computing in science & engineering, 2009.

. Tze-meng-low, D. Francisco, . Igual, M. Tyler, and E. Smith, Analytical modeling is enough for high-performance BLIS, ACM Transactions on Mathematical Software (TOMS), vol.43, issue.2, p.12, 2016.

. Field-g-van-zee, M. Tyler, B. Smith, . Marker, M. Tze et al., The BLIS framework: Experiments in portability, ACM Transactions on Mathematical Software (TOMS), vol.42, issue.2, p.12, 2016.

M. Castro, F. Dupros, E. Francesquini, J. Méhaut, and P. Navaux, Energy efficient seismic wave propagation simulation on a low-power manycore processor, Computer Architecture and High Performance Computing (SBAC-PAD), pp.57-64, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01060286

S. Raase and T. Nordström, On the use of a many-core processor for computational fluid dynamics simulations, Procedia Computer Science, vol.51, pp.1403-1412, 2015.

P. Nagar, F. Song, L. Zhu, and L. Lin, LBM-IB: A parallel library to solve 3D fluid-structure interaction problems on manycore systems, Parallel Processing (ICPP), 2015.

, 44th International Conference on, pp.51-60, 2015.

H. Sagan, Space-filling curves, 2012.

S. David and . Wise, Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free, European Conference on Parallel Processing, pp.774-783, 2000.

J. Thiyagalingam, O. Beckmann, and P. Kelly, Is Morton layout competitive for large two-dimensional arrays yet? Concurrency and Computation: Practice and Experience, vol.18, pp.1509-1539, 2006.

D. Humières, Multiple-relaxation-time lattice Boltzmann models in three dimensions, Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol.360, pp.437-451, 1792.

A. Sodani, Knights Landing (KNL): 2nd Generation Intel R Xeon Phi processor, Hot Chips 27 Symposium (HCS), 2015 IEEE, pp.1-24, 2015.

V. Codreanu and J. Rodríguez, Best Practice Guide -Knights Landing, 2017.

. Ronald-w-green, Memory movement and initialization: Optimization and control, 2014.

B. Dupont-de-dinechin, P. Guironnet-de-massas, G. Lager, C. Léger, B. Orgogozo et al., A Distributed Run-Time Environment for the Kalray MPPA R -256 Integrated Manycore Processor, Procedia Computer Science, vol.18, pp.1654-1663, 2013.

, Tile processor architecture overview for the Tilepro series, 2013.

J. Mottin, M. Cartron, and G. Urlini, The STHORM Platform, Smart Multicore Embedded Systems, pp.35-43, 2014.
URL : https://hal.archives-ouvertes.fr/cea-01818395

N. Hemsoth, The Tiny Chip That Could Disrupt Exascale Computing, 2015.

J. Jack, J. D. Dongarra, S. Croz, I. S. Hammarling, and . Duff, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software (TOMS), vol.16, issue.1, pp.1-17, 1990.

Z. Xianyi, Z. Qian, . Chothia, and . Openblas, , pp.9-13, 2013.

J. Jeffers and J. Reinders, Intel Xeon Phi Coprocessor High Performance Programming, 2013.

L. Prylli and B. Tourancheau, BIP: a new protocol designed for high performance networking on myrinet, Parallel and Distributed Processing, pp.472-485, 1998.

L. Prylli, B. Tourancheau, and R. Westrelin, Modeling of a high speed network to maximize throughput performance: the experience of BIP over Myrinet, Parallel and Distributed Processing Techniques and Applications-PDPTA, vol.2, pp.341-349, 1998.

M. Bedford-taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat et al., The Raw microprocessor: A computational fabric for software circuits and general-purpose programs, IEEE, vol.22, issue.2, pp.25-35, 2002.

P. James-ryan, rMPI: An MPI-compliant message passing library for tiled architectures, 2005.

J. Psota and A. Agarwal, rMPI: message passing on multicore processors with on-chip interconnect, High Performance Embedded Architectures and Compilers, pp.22-37, 2008.

M. Kang, E. Park, M. Cho, J. Suh, D. Kang et al., MPI performance analysis and optimization on tile64/maestro, Proceedings of Workshop on Multicore Processors for Space-Opportunities and Challenges Held in conjunction with SMC-IT, pp.19-23, 2009.

M. Bruno-d'ausbourg, E. Boyer, C. Noulard, and . Pagetti, Deterministic Execution on Many-Core Platforms: application to the SCC, 4th Many-core Applications Research Community (MARC) Symposium, p.43, 2012.

G. Timothy, M. Mattson, T. Riepen, P. Lehnig, W. Brett et al., The 48-core SCC processor: the programmer's view, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.

C. Clauss, S. Lankes, P. Reble, and T. Bemmerl, Evaluation and improvements of programming models for the Intel SCC many-core processor, High Performance Computing and Simulation (HPCS), 2011 International Conference on, pp.525-532, 2011.

S. Potluri, K. Hamidouche, D. Bureddy, and D. Panda, MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand

T. Von-eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, Active messages: a mechanism for integrated communication and computation, vol.20, 1992.

K. Inc, MPPA-256 Cluster and I/O Subsystem Architecture, 2015.

K. Inc and . Mppaipc-performance, , 2013.

S. Kumar, A. Jantsch, J. Soininen, M. Forsell, M. Millberg et al., A network on chip architecture and design methodology, VLSI, 2002. Proceedings. IEEE Computer Society Annual Symposium on, pp.105-112, 2002.

K. Matsumoto, N. Nakasato, G. Stanislav, and . Sedukhin, Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs, High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp.396-405, 2012.

N. Rajovic, M. Paul, I. Carpenter, N. Gelado, A. Puzovic et al., Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC?, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.40, 2013.

M. Feldman, Project Sets Sights on Exascale Processor, 2017.

S. Pakin, V. Karamcheti, and A. Chien, Fast Messages: Efficient, portable communication for workstation clusters and MPPs, IEEE concurrency, vol.5, issue.2, pp.60-72, 1997.

Y. Dou, S. Vassiliadis, G. Krasimirov-kuzmanov, and G. Gaydadjiev, 64-bit floating-point FPGA matrix multiplication, Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pp.86-95, 2005.

M. Francisco-d-igual, A. Ali, E. Friedmann, T. Stotzer, R. A. Wentz et al., Unleashing the high-performance and low-power of multi-core DSPs for generalpurpose HPC, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.26, 2012.

J. Kurzak, A. Buttari, and J. Dongarra, Solving systems of linear equations on the CELL processor using Cholesky factorization, IEEE Transactions on Parallel and Distributed Systems, vol.19, issue.9, pp.1175-1186, 2008.
URL : https://hal.archives-ouvertes.fr/hal-02421046

V. Saxena, P. Agrawal, Y. Sabharwal, K. Vijay, . Garg et al., Optimization of BLAS on the Cell Processor, HiPC, vol.5374, pp.18-29, 2008.

J. Lin, Z. Xu, A. Nukada, N. Maruyama, and S. Matsuoka, Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor, Parallel Processing (ICPP), 2017 46th International Conference on, pp.432-441, 2017.

M. Tasende, Generation of the Single Precision BLAS library for the Parallella platform, with Epiphany co-processor acceleration, using the BLIS framework, Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress

/. Datacom and . Cyberscitech, 2016 IEEE 14th Intl C, pp.894-897, 2016.

M. Ali, E. Stotzer, D. Francisco, R. A. Igual, and . Van-de-geijn, Level-3 BLAS on the TI C6678 multi-core DSP, Computer Architecture and High Performance Computing (SBAC-PAD), pp.179-186, 2012.

D. Parikh, D. Francisco, M. Igual, and . Ali, Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS, 2017.

D. Parikh and W. Leven, An Implementation of GEMM for DMA-enabled Architectures, 2017.

, Texas Instruments (TI)). MCSDK HPC 3.x Linear Algebra Library, 2017.

T. Szydzik, M. Farcas, V. Ohan, and D. Moloney, Level-3 BLAS on Myriad multi-core media-processor SoC, Hot Chips 26 Symposium (HCS), pp.1-1, 2014.

, IEEE, 2014.

T. Field-g-van-zee and . Smith, Implementing high-performance complex matrix multiplication via the 3m and 4m methods, ACM Transactions on Mathematical Software. Under review, 2017.

M. Tyler, R. Smith, M. Van-de-geijn, J. R. Smelyanskiy, F. Hammond et al., Anatomy of high-performance many-threaded matrix multiplication, Parallel and Distributed Processing Symposium, pp.1049-1059, 2014.