, void * local , 11 point2d_t * global_point , point2d_t * local_point , 12 size_t elem_size , int width , int height , dma_event_t * event ) ; * local , void * global , 16 point2d_t * global_point , point2d_t * local_point , 17 size_t elem_size , int width , int height , dma_event_t * event )
Benoît Dupont de Dinechin, and Jérôme Reybert. MPI communication on MPPA many-core NoC: design, modeling and performance issues ,
, Parallel Computing: On the Road to Exascale, Proceedings of the International Conference on Parallel Computing, vol.27, pp.113-122, 2015.
Asynchronous one-sided communications and synchronizations for a clustered manycore processor, Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia, pp.51-60, 2017. ,
Improving 3D lattice Boltzmann method stencil with asynchronous transfers on manycore processors, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC) (IPCCC 2017), 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01652614
New parallel in-place update algorithm for better memory usage in 3D lattice Boltzmann algorithm, 2017. ,
BLIS-RDMA: A portable and high performance level-3 BLAS for DMA-based many-core architectures, 2017. ,
Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009. ,
, OpenACC: More Science Less Programming, 2018.
, Editor : Aaftab Munshi. The OpenCL Specification. Version 1, 2012.
Using MPI: portable parallel programming with the message-passing interface, vol.1, 1999. ,
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, ACM SIGPLAN Notices, vol.48, issue.6, pp.519-530, 2013. ,
Khronos SYCL for OpenCL: a tutorial, Proceedings of the 3rd International Workshop on OpenCL, p.24, 2015. ,
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, vol.23, issue.2, pp.187-198, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00384363
Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, In ACM SIGPLAN Notices, vol.48, pp.167-176, 2013. ,
Algorithmbased fault tolerance for dense matrix factorizations, ACM SIGPLAN Notices, vol.47, issue.8, pp.225-234, 2012. ,
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
Efficacy and efficiency of algorithm-based fault-tolerance on GPUs, On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International, pp.240-243, 2013. ,
An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs, IEEE Transactions on Nuclear Science, vol.60, issue.4, pp.2797-2804, 2013. ,
Lattice-gas automata for the Navier-Stokes equation, Physical review letters, vol.56, issue.14, p.1505, 1986. ,
Use of the Boltzmann equation to simulate lattice-gas automata, Physical review letters, vol.61, issue.20, p.2332, 1988. ,
Physical symmetry and lattice symmetry in the lattice Boltzmann method, Physical Review E, vol.55, issue.1, p.21, 1997. ,
Achieving high performance in a LBM code using OpenMP, The Fourth European Workshop on OpenMP, 2002. ,
On the performance portability of structured grid codes on many-core computer architectures, Supercomputing, pp.53-75, 2014. ,
Performance Evaluation of an OpenCL Implementation of the Lattice Boltzmann Method on the Intel Xeon phi, Parallel Processing Letters, vol.25, issue.03, p.1541001, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01286306
Optimization and profiling of the cache performance of parallel lattice Boltzmann codes, Parallel Processing Letters, vol.13, issue.04, pp.549-560, 2003. ,
An efficient swap algorithm for the lattice Boltzmann method, Computer Physics Communications, vol.176, issue.3, pp.200-210, 2007. ,
Comparison of implementations of the lattice-Boltzmann method, Computers & Mathematics with Applications, vol.55, issue.7, pp.1514-1524, 2008. ,
Comparison of different propagation steps for lattice Boltzmann methods, Computers & Mathematics with Applications, vol.65, issue.6, pp.924-935, 2013. ,
Accelerating lattice Boltzmann fluid flow simulations using graphics processors, Parallel Processing, 2009. ICPP'09. International Conference on, pp.550-557, 2009. ,
Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel Hardware, Computation, vol.5, issue.2, p.19, 2017. ,
Efficient GPU implementation of the linearly interpolated bounce-back boundary condition, Computers & Mathematics with Applications, vol.65, issue.6, pp.936-944, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00731150
,
A clustered manycore processor architecture for embedded and accelerated applications, High Performance Extreme Computing Conference (HPEC), 2013 IEEE, pp.1-6, 2013. ,
The Sunway TaihuLight supercomputer: system and applications, Science China Information Sciences, vol.59, issue.7, p.72001, 2016. ,
Basic linear algebra subprograms for Fortran usage, ACM Transactions on Mathematical Software (TOMS), vol.5, issue.3, pp.308-323, 1979. ,
A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software (TOMS), vol.16, issue.1, pp.1-17, 1990. ,
LAPACK: A portable linear algebra library for high-performance computers, Proceedings of the, 1990. ,
, ACM/IEEE conference on Supercomputing, pp.2-11, 1990.
ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers, Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the, pp.120-127, 1992. ,
A survey of memory bandwidth and machine balance in current high performance computers, IEEE TCCA Newsletter, pp.19-25, 1995. ,
HPL -A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers, 2008. ,
The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience, vol.15, pp.803-820, 2003. ,
, Intel. Intel Math Kernel Library, 2017.
, Advanced Micro Devices (AMD), 2017.
High-performance implementation of the level-3 BLAS, ACM Transactions on Mathematical Software (TOMS), vol.35, issue.1, p.4, 2008. ,
Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software (TOMS), vol.34, issue.3, p.12, 2008. ,
, OpenBLAS, 2017.
Automated empirical optimizations of software and the ATLAS project, Parallel Computing, vol.27, issue.1, pp.3-35, 2001. ,
Porting the plasma numerical library to the openmp standard, International Journal of Parallel Programming, vol.45, issue.3, pp.612-633, 2017. ,
BLIS: A framework for rapidly instantiating BLAS functionality, ACM Transactions on Mathematical Software (TOMS), vol.41, issue.3, p.14, 2015. ,
, GPU-accelerated standard BLAS library, 2017.
, A software library containing BLAS functions written in OpenCL, 2017.
, Next generation BLAS implementation for ROCm platform, 2017.
Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.232-240, 2010. ,
CLBlast: A Tuned OpenCL BLAS Library. CoRR, 2017. ,
KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators, 2014. ,
Introducing: The libflame library for dense matrix computations, Computing in science & engineering, 2009. ,
Analytical modeling is enough for high-performance BLIS, ACM Transactions on Mathematical Software (TOMS), vol.43, issue.2, p.12, 2016. ,
The BLIS framework: Experiments in portability, ACM Transactions on Mathematical Software (TOMS), vol.42, issue.2, p.12, 2016. ,
Energy efficient seismic wave propagation simulation on a low-power manycore processor, Computer Architecture and High Performance Computing (SBAC-PAD), pp.57-64, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01060286
On the use of a many-core processor for computational fluid dynamics simulations, Procedia Computer Science, vol.51, pp.1403-1412, 2015. ,
LBM-IB: A parallel library to solve 3D fluid-structure interaction problems on manycore systems, Parallel Processing (ICPP), 2015. ,
, 44th International Conference on, pp.51-60, 2015.
Space-filling curves, 2012. ,
Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free, European Conference on Parallel Processing, pp.774-783, 2000. ,
Is Morton layout competitive for large two-dimensional arrays yet? Concurrency and Computation: Practice and Experience, vol.18, pp.1509-1539, 2006. ,
Multiple-relaxation-time lattice Boltzmann models in three dimensions, Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol.360, pp.437-451, 1792. ,
Knights Landing (KNL): 2nd Generation Intel R Xeon Phi processor, Hot Chips 27 Symposium (HCS), 2015 IEEE, pp.1-24, 2015. ,
Best Practice Guide -Knights Landing, 2017. ,
Memory movement and initialization: Optimization and control, 2014. ,
A Distributed Run-Time Environment for the Kalray MPPA R -256 Integrated Manycore Processor, Procedia Computer Science, vol.18, pp.1654-1663, 2013. ,
, Tile processor architecture overview for the Tilepro series, 2013.
The STHORM Platform, Smart Multicore Embedded Systems, pp.35-43, 2014. ,
URL : https://hal.archives-ouvertes.fr/cea-01818395
The Tiny Chip That Could Disrupt Exascale Computing, 2015. ,
Basic linear algebra subprograms for Fortran usage, ACM Transactions on Mathematical Software (TOMS), vol.5, issue.3, pp.308-323, 1979. ,
A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software (TOMS), vol.16, issue.1, pp.1-17, 1990. ,
, , pp.9-13, 2013.
Intel Xeon Phi Coprocessor High Performance Programming, 2013. ,
BIP: a new protocol designed for high performance networking on myrinet, Parallel and Distributed Processing, pp.472-485, 1998. ,
Modeling of a high speed network to maximize throughput performance: the experience of BIP over Myrinet, Parallel and Distributed Processing Techniques and Applications-PDPTA, vol.2, pp.341-349, 1998. ,
The Raw microprocessor: A computational fabric for software circuits and general-purpose programs, IEEE, vol.22, issue.2, pp.25-35, 2002. ,
rMPI: An MPI-compliant message passing library for tiled architectures, 2005. ,
rMPI: message passing on multicore processors with on-chip interconnect, High Performance Embedded Architectures and Compilers, pp.22-37, 2008. ,
MPI performance analysis and optimization on tile64/maestro, Proceedings of Workshop on Multicore Processors for Space-Opportunities and Challenges Held in conjunction with SMC-IT, pp.19-23, 2009. ,
Deterministic Execution on Many-Core Platforms: application to the SCC, 4th Many-core Applications Research Community (MARC) Symposium, p.43, 2012. ,
The 48-core SCC processor: the programmer's view, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010. ,
Evaluation and improvements of programming models for the Intel SCC many-core processor, High Performance Computing and Simulation (HPCS), 2011 International Conference on, pp.525-532, 2011. ,
MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand ,
Active messages: a mechanism for integrated communication and computation, vol.20, 1992. ,
MPPA-256 Cluster and I/O Subsystem Architecture, 2015. ,
, , 2013.
A network on chip architecture and design methodology, VLSI, 2002. Proceedings. IEEE Computer Society Annual Symposium on, pp.105-112, 2002. ,
Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs, High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp.396-405, 2012. ,
Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC?, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.40, 2013. ,
, Project Sets Sights on Exascale Processor, 2017.
Fast Messages: Efficient, portable communication for workstation clusters and MPPs, IEEE concurrency, vol.5, issue.2, pp.60-72, 1997. ,
64-bit floating-point FPGA matrix multiplication, Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pp.86-95, 2005. ,
Unleashing the high-performance and low-power of multi-core DSPs for generalpurpose HPC, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.26, 2012. ,
Solving systems of linear equations on the CELL processor using Cholesky factorization, IEEE Transactions on Parallel and Distributed Systems, vol.19, issue.9, pp.1175-1186, 2008. ,
URL : https://hal.archives-ouvertes.fr/hal-02421046
Optimization of BLAS on the Cell Processor, HiPC, vol.5374, pp.18-29, 2008. ,
Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor, Parallel Processing (ICPP), 2017 46th International Conference on, pp.432-441, 2017. ,
Generation of the Single Precision BLAS library for the Parallella platform, with Epiphany co-processor acceleration, using the BLIS framework, Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress ,
, 2016 IEEE 14th Intl C, pp.894-897, 2016.
Level-3 BLAS on the TI C6678 multi-core DSP, Computer Architecture and High Performance Computing (SBAC-PAD), pp.179-186, 2012. ,
Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS, 2017. ,
An Implementation of GEMM for DMA-enabled Architectures, 2017. ,
, Texas Instruments (TI)). MCSDK HPC 3.x Linear Algebra Library, 2017.
Level-3 BLAS on Myriad multi-core media-processor SoC, Hot Chips 26 Symposium (HCS), pp.1-1, 2014. ,
, IEEE, 2014.
Implementing high-performance complex matrix multiplication via the 3m and 4m methods, ACM Transactions on Mathematical Software. Under review, 2017. ,
Anatomy of high-performance many-threaded matrix multiplication, Parallel and Distributed Processing Symposium, pp.1049-1059, 2014. ,