, Cluster Synchronization */ 2. synchro

, # pragma omp parallel for /* intra -cluster */ 12. for ( int i =0; i < M ; i ++) 13. dilation

, # pragma omp parallel for /* intra -cluster */ 16. for ( int i =0; i < M ; i ++) 17. erosion

, Cluster Synchronization */ 22. synchro

, 16. vxVerifyGraph ( graph

, i ++) 21. vxReleaseNode (& nodes

, // Delete nodes 22. for

, i ++) 23. vxReleaseImage (& images

, // Delete images 24. vxReleaseGraph (& graph )

, vxReleaseContext (& context, vol.25

, 3 Nvidia Drive PX 2: A Complex and Highly Parallel Heterogeneous Embedded System for Autonomous Driving

. .. Mppa-®-processor,

, 21 2.6 (IO)Memory Management Unit (MMU) Role in a Heterogeneous Computer System, Typical Central Processing Unit (CPU) Cores Linked to a Memory with Memory Access examples

.. .. Symmetric-multi-processing,

, Example of Pthread Multi-threading Programming

, OpenCL Mapping of Applications and Memory Model Source: Kalray's

, Dataflow Process Network (DPN) Programming Model Example and Semantic 36

. .. , Single-Rate Transformation Synchronous Dataflow (SDF) (left) to SingleRate Directed Acyclic Graph (DAG) (SRDAG) (right), p.37

C. Dataflow,

, Flattening and the Single-Rate Transformation Hierarchical SDF (left) to Single-Rate DAG (SRDAG) (right)

, Flattening and Single-Rate Transformation of Interface-Based SDF (IBSDF) (left) to Single-Rate DAG (SRDAG) (right)

). .. , 41 3.11 Parameterized and Interfaced SDF (PiSDF) Programming Model Semantics and Example

. .. Typical-preesm&apos;s-rapid-prototyping-workflow, , p.45

, The sending transfer initiated by the left CPU must strictly match the received command initiated by the right CPU

, Swap in Practice on the k1 VLIW Core, p.64

. .. Functions, 72 5.2 Remote Direct Memory Access (RDMA) Put and Get Operation on Window Memory Segments, Memory Segment Usages with the Create and Clone

, Enqueue and Dequeue Operation using Remote Queue Memory Segments, p.75

, RDMA Put/Get Data Transfer Restructuring Pattern, p.76

, Architecture of Asynchronous One-Sided (AOS) in a Compute Cluster, p.85

;. .. Rdma-get, Read) Throughput GB/s (Asynchronous), p.86

;. .. Rdma-put, Write) Throughput GB/s (Asynchronous), vol.87

;. .. Rdma-get, 10 RDMA Put (Write) Latency µs (Blocking), Read) Latency µs (Blocking), vol.87

. .. Active-message-latency,

, Specific States & Transitions of Threads in the New Multi-Threading Runtime

, Build and Test Process for the Integration of the New Multi-threading Runtime in the Software Toolchain

, RDMA Transfers for Automatic Double Buffering (parallel code)

, Join and Basic Synchronization Primitives on 16 Cores

, Join and Basic Synchronization Primitives on 64 Threads

G. Openmp and . Libgomp, Based on our New Multi-threading Runtime with 16 Threads Running

;. Ibsdf-graph and . .. Denoising, , p.117

, New PREESM Workflow for Clustering and Parallel Loop Generation, p.118

, Generated Code Example inside the Compute Clusters (CCs) of the Manycore Processor

. .. , MPPA ® Matrix Result in Frames per second (fps), p.126

, Communications and Processing Time (lower is better, lower means more Processing Elements (PEs) efficiency). Communication Overheads Relative to Total Execution Time, MPPA ® Matrix Results Ratios between Network on Chip (NoC)

, Architecture of the Reconfigurable Dataflow Runtime onto a DMA-Enabled Clustered Manycore Processor

, The number of requests is the number of input First-In-First-Out queues (FIFOs) of the next actor

. .. , Algorithm for the Local Memory Allocation in the CC, p.136

, Application Performance on a 4K Video

, Example of an OpenVX Application

. .. Openvx-offloading-engine-architecture,

]. .. , Example of a Graph Display from the Input/Output Subsystem (IO), Schedule and Fusion Optimizations, OpenVX Verify Graph Workflow -vxVerifyGraph [G + 17, vol.149

, Automated Multi-clusters Tiling Combined with Fusion, p.154

, Example of Geometrical Transformation, namely a Rotation, p.155

. .. , Automatic Tiling Engine Performance. VGA Images, Simple Tiling vs Tiling with N-Buffering (N_BUF = N-Buffering = Prefetch), p.159

. .. , Automatic Tiling Engine Performance. Full HD Images, Simple Tiling vs Tiling with N-Buffering (N_BUF = N-Buffering = Prefetch), p.159

. .. Fu-sion), Automatic RDMA-based Kernel Fusion Performance. VGA Images. Tiling with N-Buffering (N_BUF = N-Buffering = Prefetch) vs Kernel Fusing

. .. Fu-sion), Automatic RDMA-based Kernel Fusion Performance. Full HD Images. Tiling with N-Buffering (N_BUF = N-Buffering = Prefetch) vs Kernel Fusing

, RDMA-based 2D Explicit Cache of Tiles Performance. VGA Images, p.161

, 15 Mono-Cluster RDMA-based 2D Explicit Cache of Tiles Performance, RDMA-based 2D Explicit Cache of Tiles Performance. Full HD Images, vol.162

. Mono-cluster, RDMA-based 2D Explicit Cache of Tiles Performance. Full HD Images

;. Lattice-boltzmann-method and . .. Lbm)-d3q19-stencil, 166 10.2 3D LBM/stencil decomposition where a Main-node subdomain (green) is copied with its surrounding halo layers (if exists) and one extra subdomain (blue) is needed to store the post-collision state

, represented by: B a : index of S on local memory (from A) and B r : index of S on main memory (from R), B: beginning of the copied 3D tile (S), p.169

, OPAL_async vs. OPAL OpenCL on MPPA ® for duration = 1000 steps, p.172

, Performance extrapolation of OPAL_async with 8 × 8 × 8 subdomains with the first eight CCs correlation represented by a gray line for 1000 timesteps and cavity size 128

, Architecture of the Distributed FFT for Low-Latency Execution over Several Compute Clusters (CCs)

. .. Vliw-pe, 180 10.8 Execution Time of the Mono-Cluster Fast Fourier Transform (FFT). The Higher, The Better, Example of a Vectorization (pair of registers) in the k1

. .. Framework, 182 10.11Broadcast Operation From the Main Double Data Rate (DDR)3 Memory to the CCs, Execution Time of Distributed Multi-Cluster FFT. The Higher, The Better. 181 10.10Architecture of the Kalray Neural Network (KANN)

A. Le and .. .. Mppa-®-de-kalray,

. .. , Utilisation des segments mémoires et des protocoles

, Vérification et optimisation du graphe OpenVX applicatif -vxVerifyGraph, p.203

, Automatisation de la fusion de noeuds standard OpenVX à l'exécution, p.203

, NoC Resources used by the AOS library for each of the Compute Cluster (CC) and each of the Input/Output Subsystem (IO) Composing an Entire MPPA ® Processor

. .. Gb/s, 89 5.3 Performance of the Remote Queues in Kilo Input/Output Operation per Second (IOPS)

, Scheduler Condition Call on Standard Primitives for Cooperative

, Auto-threading Throughput on Three Different Use-cases, p.113

, fps and Speedups for Texas Instruments (TI) Digital Signal Processor (DSP) and Intel Processor

S. Fps and . .. Cluster,

, 163 10.1 3-depth pipeline (triple-buffering) which allows a 2-step distance between GET and WAIT, but only a 1-step distance between PUT and WAIT, thus the PUT transfer will not be well overlapped, Multi-cluster Performance of the Harris Corner Detection of OpenVX on MPPA ® in fps

G. Get,

P. Put,

W. Wait,

C. Compute,

. Wcp-=-{wait-+-compute-+-put},

. .. Wg-=-{wait-+-get}), , p.170

, Summary of the Memory Footprint of the Distributed FFT on Several CCs, p.179

, Performance of the GoogleNet Convolutional Neural Network (CNN) batch-1 (latency = throughput) using Single-precision Floating-point Operation, p.184

, AI Arithmetic Intensity, vol.172, p.184

, AMD Advanced Micro Devices, vol.144, p.145

, AOS Asynchronous One-Sided. 70-72, vol.77, p.183

, API Application Programming Interface, vol.5, p.193

, ARM Advanced Reduced Instruction Set Computer (RISC) Machine, p.14

, ARMCI Aggregate Remote Memory Copy Interface, p.53

D. Bdf-boolean, , p.40

, BF Best-Fit, vol.44, p.46

, CNN Convolutional Neural Network. 7, 8, 54, vol.77, p.189

. Csdf-cyclo-static and . Dataflow, , vol.37, p.49

, CTA Compositional Temporal Analysis, vol.115

, CUDA Compute Unified Device Architecture, vol.33, p.34

, DAG Directed Acyclic Graph, vol.43, p.189

, DCB Data Center Bridging, vol.52, p.58

. Ddr-double-data-rate, , vol.15, pp.183-185

, DFS Depth-First Search, vol.42, p.149

F. Dft-discrete and . Transform, , vol.174

, DMA Direct Memory Access, vol.5, pp.78-89

, DPN Dataflow Process Network, p.48

, DSL Domain Specific Language, vol.45, p.145

, DSM Distributed Shared Memory, vol.24, p.166

, DSP Digital Signal Processor, vol.4, p.129

, DSSF Deterministic SDF with Shared FIFOs, vol.39

L. Elf-executable and . Format, , vol.22, p.108

, FF First-Fit, vol.44, p.46

, FFT Fast Fourier Transform, vol.88, p.189

, FLOPS Floating Point Operations per Second, vol.13, p.17

, FPGA Field-Programmable Gate Array, vol.12, p.145

, fps Frames per second, vol.125, p.163

. Gcc-gnu-compiler, , 0200.

, GCC GNU Compiler Collection, vol.106, p.108

, GPU Graphics Processing Unit, vol.52, p.192

. Grt-global-runtime, , vol.47, p.139

, HAL Hardware Abstraction Layer, vol.11, p.24

B. Hbm-high and . Memory, , vol.15, p.173

. Hbw-halo and . Bandwidth, , p.171

, HPC High-Performance Computing, vol.58, p.158

, IBSDF Interface-Based SDF. 38-40, vol.45, p.141

, IBTA Infiniband Trade Association, vol.52

, IETR Institute of Electronics and Telecommunications of Rennes. 5-7, 45, vol.188, p.205

, ILP Instruction-Level Parallelism, vol.13, p.185

, IO Input/Output Subsystem, vol.124, p.192, 0111.

, IOCTL Input/Output Control, vol.21, p.91

, IOPS Input/Output Operation per Second, vol.75, p.188

, IoT Internet of Things, vol.3

, IP Intellectual Property, vol.144

, IPC Inter-Process Communication, vol.69, p.144

, IR Intermediate Representation, p.149

, ISA Instruction Set Architecture, vol.60, p.63

, KANN Kalray Neural Network, vol.82, p.182

, KPN Kahn Process Network, vol.35, p.49

. Glossary-lbm-lattice-boltzmann and . Method, , vol.184, pp.171-173

, LLC Last Level Cache, vol.59

, LLVM Low-Level Virtual Machine, p.191

. Lrt-local-runtime, , vol.47, pp.130-137

, MLUPS Mega Lattice Updates per Second, pp.171-173

, MoC Model of Computation, vol.35

, MPI Message Passing Interface, vol.28, p.154

. Mppa-multi-purpose, , vol.60, p.205, 0202.

. Mpsoc-multiprocessor-system, , vol.182, p.184

, NMTR New Multi-Threading Runtime. 95-98, vol.101, p.114

, NORMA No Remote Memory Access, p.16

, NUMA Non-Uniform Memory Access, vol.13, p.57

, OFA Open-Fabrics Association, p.52

, OpenMP Open Multi-Processing, vol.116, p.129

R. Orcc-open and . Compiler, , p.129

. Os-operating and . System, , vol.11, p.145, 1998.

, PCIE Peripheral Component Interconnect Express, vol.21, p.183

, PDF Particle Distribution Function, p.166

, PE Processing Element, vol.12, pp.177-180

, PGAN Pairwise Grouping of Adjacent Nodes, p.121

, PGAS Partitioned-Global-Address-Space, vol.53, p.65

, PIC Position Independent Code, vol.101, p.147

, PiMM Parameterized and Interfaced dataflow Meta-Model, vol.41

, PiSDF Parameterized and Interfaced SDF, vol.40, pp.137-139

, PREESM Parallel and Real-time Embedded Executives Scheduling Method, vol.6, p.201

S. Psdf-parameterized, , p.40

, PSO Partial Store Order, vol.60

, QoS Quality-of-Service, vol.20, p.70

, QPI QuickPath Interconnect, vol.16, p.58

, RAM Random Access Memory, vol.14, p.15

. Raw-read-after-write, , vol.60, p.192

, RDDP Remote Direct Data Placement, p.52

, RISC Reduced Instruction Set Computer, vol.4, p.14

. Rm-resource and . Manager, , vol.19, p.106

, RMO Relaxed Memory Order, vol.60

, RoCE RDMA over Converged Ethernet, vol.52, p.65

. Rr-round-robin, , vol.79, p.135

, RTOS Real-Time Operating System, vol.79, p.191

, RV Repetition Vector, vol.46, p.128

. Sadf-scenario-aware and . Dataflow, , p.40

D. Sdf-synchronous, , vol.42, p.142

, SDFG Synchronous Dataflow Graph, vol.35

, SIMD Single Instruction, Multiple Data, vol.12, p.189

, SIMT Single Instruction, Multiple Threads, vol.12, p.192

, SISD Single Instruction, Single Data, p.12

, SMEM Shared Memory, vol.18, p.89

J. Hascoët, B. De-dinechin, . Dupont, K. Desnos, and J. Nezan, A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore, IEEE High Performance extreme Computing Conference (HPEC), 2018.

H. Miomandre, J. Hascoët, K. Desnos, K. Martin, B. De-dinechin-kalray et al., Embedded Runtime for Reconfigurable Dataflow Graphs on Manycore Architectures, Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
URL : https://hal.archives-ouvertes.fr/hal-01704702

J. Hascoët, B. De-dinechin, . Dupont, P. De-massas, . Guironnet et al., Asynchronous one-sided communications and synchronizations for a clustered manycore processor, Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia, 2017.

J. Hascoët, K. Desnos, J. Nezan, B. De-dinechin, and . Dupont, Hierarchical Dataflow Model for efficient programming of clustered manycore processors. Application-specific Systems, 2017.

H. Miomandre, J. Hascoet, K. Desnos, K. Martin, B. De-dinechin et al., Demonstrating the SPIDER Runtime for Reconfigurable Dataflow Graphs Execution onto a DMA-based Manycore Processor, IEEE International Workshop on Signal Processing Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01637300

M. Ho, C. Obrecht, B. Tourancheau, B. De-dinechin, . Dupont et al., Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors. 36th IEEE International Performance Computing and Communications Conference, 2017.
DOI : 10.1109/pccc.2017.8280472

URL : https://hal.archives-ouvertes.fr/hal-01652614

J. Hascoet, J. Nezan, A. Ensor, B. De-dinechin, and . Dupont, Implementation of a fast Fourier transform algorithm onto a manycore processor. Design and Architectures for Signal and Image Processing (DASIP), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01238833

. Airbus, , 2018.

J. Ajanovic, Pci express 3.0 overview, Proceedings of Hot Chip: A Symposium on High Performance Chips, vol.69, p.143, 2009.

M. Adé, R. Lauwereins, and J. A. Peperstraete, Data memory minimisation for synchronous data flow graphs emulated on dsp-fpga targets, Proceedings of the 34th annual Design Automation Conference, p.44, 1997.

F. Brill and E. Albuz, Nvidia visionworks toolkit, GPU Technology Conference, p.144, 2014.

S. C. Brunet, C. Alberti, M. Mattavelli, and J. W. Janneck, Design space exploration of high level stream programs on parallel architectures: a focus on the buffer size minimization and optimization problem, Image and Signal Processing and Analysis (ISPA), p.44, 2013.

G. Barnes, A method for implementing lock-free shared-data structures, Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures, p.95, 1993.
DOI : 10.1145/165231.165265

URL : http://pubman.mpdl.mpg.de/pubman/item/escidoc%3A1834238/component/escidoc%3A2019499/MPI-I-94-120.pdf

B. Bhattacharya, S. Shuvra, and . Bhattacharyya, Parameterized dataflow modeling for dsp systems, IEEE Transactions on Signal Processing, vol.49, issue.10, p.40, 2001.
DOI : 10.1109/78.950795

P. Brucker and P. Brucker, Scheduling algorithms, vol.3, p.127, 2007.

M. Bouchard, M. ?angalovi?, and A. Hertz, About equivalent interval colorings of weighted graphs, Discrete Applied Mathematics, vol.157, issue.17, p.45, 2009.

A. Eric, F. T. Brewer, . Chong, T. Lok, . Liu et al., Remote queues: Exposing message queues for optimization and atomicity, Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, vol.70, p.80, 1995.

G. Berry, SCADE: Synchronous design and validation of embedded control software, Next Generation Design and Verification Methodologies for Distributed Embedded Control Systems, p.49, 2007.

. Berkeley, Latency Numbers Every Programmer Should Know, 2018.

V. Bebelis, P. Fradet, A. Girault, and B. Lavigueur, Bpdf: A statically analyzable dataflow model with integer and boolean parameters, Embedded Software (EMSOFT), 2013 Proceedings of the International Conference on, p.40, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00923672

J. T. Buck and E. Lee, Scheduling dynamic dataflow graphs with bounded memory using the token flow model, Acoustics, Speech, and Signal Processing, vol.1, p.44, 1993.

. Nvidia-developer and . Blog, Inside Volta: The World's Most Advanced Data Center GPU, 2017.

S. Shuvra, . Bhattacharyya, K. Praveen, E. Murthy, and . Lee, Journal of VLSI signal processing systems for signal, image and video technology, vol.21, p.149, 1999.

S. Shuvra, . Bhattacharyya, K. Praveen, E. Murthy, and . Lee, Software synthesis from dataflow graphs, vol.360, p.123, 2012.

D. Bonachea, Gasnet specification, p.53, 2008.

D. Buntinas, K. Dhabaleswar, W. Panda, and . Gropp, Nic-based atomic remote memory operations in myrinet/gm, WORKSHOP ON NOVE USES OF SYSTEM AREA NETWORKS (SAN1). Citeseer, p.58, 2001.

J. Bhimani, J. Yang, Z. Yang, N. Mi, Q. Xu et al., Understanding performance of i/o intensive containerized applications for nvme ssds, Performance Computing and Communications Conference (IPCCC), p.93, 2016.

I. Cerrato, M. Annarumma, and F. Risso, Supporting finegrained network functions through intel dpdk, Software Defined Networks (EWSDN), p.97, 2014.

N. Cao, S. Chen, J. Shi, and D. Martinez, Physical symmetry and lattice symmetry in the lattice Boltzmann method, Physical Review E, vol.55, issue.1, p.166, 1997.

B. Chapman, T. Curtis, S. Pophale, S. Poole, J. Kuehn et al., Introducing openshmem: Shmem for the pgas community, Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, vol.28, p.53, 2010.

J. Ceng, J. Castrillón, W. Sheng, H. Scharwächter, R. Leupers et al., Maps: an integrated framework for mpsoc application parallelization, Proceedings of the 45th annual Design Automation Conference, p.49, 2008.

L. Cudennec, P. Dubrulle, F. Galea, T. Goubier, and R. Sirdey, Generating code and memory buffers to reorganize data on many-core architectures, Procedia Computer Science, vol.29, p.151, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01071474

S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat et al., The reconfigurable streaming vector processor (rsvptm), Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.76, 2003.

J. E. Cooling and . Hughes, The emergence of rapid prototyping as a realtime software development tool, Software Engineering for Real Time Systems, p.45, 1989.

Y. Cheng, Autoscaling radix-4 fft for tms320c6000. application report SPRA654, vol.174, p.175, 2000.

B. Chapman, G. Jost, R. Van-der, and . Pas, Using OpenMP: portable shared memory parallel programming, vol.10, p.29, 2008.

R. Jeronimo-castrillon, G. Leupers, and . Ascheid, Maps: Mapping concurrent dataflow applications to heterogeneous mpsocs, IEEE Transactions on Industrial Informatics, vol.9, issue.1, p.49, 2013.

A. Canziani, A. Paszke, and E. Culurciello, An analysis of deep neural network models for practical applications, p.184, 2016.

M. Chavarras, F. Pescador, M. J. Garrido, E. Juairez, and C. Sanz, A multicore DSP HEVC decoder using an actorbased dataflow model and OpenMP, vol.61, p.120

F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, Pulp: A ultra-low power parallel accelerator for energy-efficient and flexible embedded vision, Journal of Signal Processing Systems, vol.84, issue.3, pp.339-354, 2016.

J. Chen, H. Sun, D. Woodruff, and Q. Zhang, Communicationoptimal distributed clustering, Advances in Neural Information Processing Systems, p.124, 2016.

W. James, J. W. Cooley, and . Tukey, An algorithm for the machine calculation of complex fourier series, Mathematics of computation, vol.19, issue.90, p.174, 1965.

D. Cohen, T. Talpey, A. Kanevsky, U. Cummings, M. Krause et al., Remote direct memory access over the converged enhanced ethernet fabric: Evaluating the options, High Performance Interconnects, 2009. HOTI 2009. 17th IEEE Symposium on, p.52, 2009.

B. Dupont-de-dinechin, R. Ayrignac, P. Beaucamps, P. Couvert, B. Ganne et al., A clustered manycore processor architecture for embedded and accelerated applications, High Performance Extreme Computing Conference (HPEC), vol.49, p.151, 2013.

B. Dupont-de-dinechin, P. Guironnet-de-massas, G. Lager, C. Léger, B. Orgogozo et al., A distributed run-time environment for the kalray mppa®-256 integrated manycore processor, Procedia Computer Science, vol.18, issue.31, p.70, 2013.

H. Deroui, K. Desnos, J. Nezan, and A. Munierkordon, Relaxed subgraph execution model for the throughput evaluation of ibsdf graphs, International Conference on Embedded Computer Systems: Architecture, Modeling and Simulation SAMOS, p.115, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01569593

H. Deroui, K. Desnos, J. Nezan, and A. Munierkordon, Throughput evaluation of dsp applications based on hierarchical dataflow models, Proceedings of the 50th International Symposium on Circuits and Systems, p.115, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01514641

. Benoit-dupont-de-dinechin, R. Marta, and R. Vincent, Atomic instruction having a local scope limited to an intermediate cache level, 2017.

K. Desnos, Memory Study and Dataflow Representations for Rapid Prototyping of Signal Processing Applications on MPSoCs, vol.40, p.207, 2014.
URL : https://hal.archives-ouvertes.fr/tel-01127297

F. Eddy-de-greef, H. Catthoor, and . De-man, Array placement for storage size reduction in embedded multimedia systems, ApplicationSpecific Systems, Architectures and Processors, 1997. Proceedings., IEEE International Conference on, p.45, 1997.

J. Jack, P. Dongarra, A. Luszczek, and . Petitet, The linpack benchmark: past, present and future. Concurrency and Computation: practice and experience, vol.15, pp.803-820, 2003.

P. Guironnet-de-massas, Etude de méthodes et mécanismes pour un acces transparent et efficace aux données dans un systeme multiprocesseur sur puce, p.24, 2009.

. Gcc-documentation, Built-in functions for atomic memory access, vol.61, p.65, 2007.

K. Desnos, M. Pelcat, J. Nezan, S. Shuvra, S. Bhattacharyya et al., Pimm: Parameterized and interfaced dataflow meta-model for mpsocs runtime reconfiguration, Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013 International Conference on, vol.115, p.129, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00877492

K. Desnos, M. Pelcat, J. Nezan, and S. Aridhi, Buffer merging technique for minimizing memory footprints of synchronous dataflow specifications, Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, p.44, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01146340

K. Desnos, M. Pelcat, J. Nezan, and S. Aridhi, Distributed memory allocation technique for synchronous dataflow graphs, Signal Processing Systems (SiPS), vol.46, p.124, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01390486

U. Drepper, Elf handling for thread-local storage, p.22, 2003.

M. Dolle and M. Schlett, A cost-effective risc/dsp microprocessor for embedded systems, IEEE Micro, vol.15, issue.5, pp.32-40, 1995.

. Uml-executable, A foundation for model-driven architecture, p.35, 2002.

K. Feind, Shared memory access (shmem) routines. Cray Research, p.53, 1995.

P. Fradet, A. Girault, and P. Poplavko, Spdf: A schedulable parametric data-flow moc, Proceedings of the Conference on Design, Automation and Test in Europe, p.40, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00744376

J. Michael and . Flynn, Some computer organizations and their effectiveness, IEEE transactions on computers, vol.100, issue.9, p.12, 1972.

, Khronos OpenCL Working Group et al. The opencl specification version 1, vol.32, p.144, 2011.

, Khronos Vision Working Group et al. The openvx specification v1, vol.1, 0209.

E. Gamma, Design patterns: elements of reusable object-oriented software, p.132, 1995.

R. Giduthuri, The OpenVX Safety Critical, 2017.

. Kahn-gilles, The semantics of a simple language for parallel programming. Information processing, vol.74, p.35, 1974.

V. Alexandros, S. Gerbessiotis, and . Lee, Remote memory access: A case for portable, efficient and library independent parallel programming, Scientific Programming, vol.12, issue.3, p.53, 2004.

A. Graillat, M. Moy, P. Raymond, and B. Dupont-de-dinechin, Parallel code generation of synchronous programs for a manycore architecture, Design, Automation and Test in Europe, p.49, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01667594

D. Gelernter, A. Nicolau, and D. A. Padua, Languages and compilers for parallel computing, p.53, 1990.

S. Gorlatch, Send-receive considered harmful: Myths and realities of message passing, ACM Trans. Program. Lang. Syst, vol.26, issue.1, p.80, 2004.

R. Giduthuri and K. Pulli, Openvx: a framework for accelerating computer vision, SIGGRAPH ASIA 2016 Courses, p.144, 2016.

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett et al., Leveraging mpi's one-sided communication interface for shared-memory programming. Recent advances in the message passing interface, vol.28, p.120, 2012.

J. Hascoët, B. Dupont-de-dinechin, P. Guironnet-de-massas, and M. Q. Ho, Asynchronous one-sided communications and synchronizations for a clustered manycore processor, Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia, vol.93, p.153, 2017.

J. Heulot, K. Desnos, J. Nezan, M. Pelcat, M. Raulet et al., An experimental toolchain based on high-level dataflow models of computation for heterogeneous mpsoc, Design and Architectures for Signal and Image Processing (DASIP), 2012 Conference on, p.46, 2012.

T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji et al., Remote memory access programming in mpi-3, ACM Transactions on Parallel Computing, vol.2, issue.2, p.120, 2015.

M. Heideman, D. Johnson, and C. Burrus, Gauss and the history of the fast fourier transform, IEEE ASSP Magazine, vol.1, issue.4, p.174, 1984.

J. Hascoet, J. Nezan, A. Ensor, and B. Dupont-de-dinechin, Implementation of a fast fourier transform algorithm onto a manycore processor, Design and Architectures for Signal and Image Processing, vol.174, p.176, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01238833

L. John, D. Hennessy, and . Patterson, Computer architecture: a quantitative approach, p.59, 2011.

J. Heulot, M. Pelcat, K. Desnos, J. Nezan, and S. Aridhi, Spider: A synchronous parameterized and interfaced dataflow-based rtos for multicore dsps, Education and Research Conference (EDERC), pp.167-171, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01067052

, IEEE, vol.40, p.188, 2014.

Z. Khaled, . Ibrahim, H. Paul, C. Hargrove, K. Iancu et al., An evaluation of one-sided and two-sided communication paradigms on relaxed-ordering interconnect, Parallel and Distributed Processing Symposium, vol.56, p.74, 2014.

, National Supercomputing Center in Wuki. Sunway TaihuLight Supercomputer System, 2018.

S. David and . Johnson, Near-optimal bin packing algorithms, p.44, 1973.

. Kalray, Deep Learning for High-Performance Embedded Applications, p.182

J. Peter, A. L. Keleher, S. Cox, W. Dwarkadas, and . Zwaenepoel, Treadmarks: Distributed shared memory on standard workstations and operating systems, USENIX Winter, vol.1994, p.57, 1994.

H. Kim, I. E. Hajj, J. Stratton, S. Lumetta, and W. Hwu, Locality-centric thread scheduling for bulk-synchronous programming models on cpu architectures, Code Generation and Optimization (CGO), p.165, 2015.

J. Jr, C. Lochbaum, . Victor, and . Vyssotsky, A block diagram compiler, Bell System Technical Journal, vol.40, issue.3, p.35, 1961.

J. David, J. King, and . Launchbury, Structuring depth-first search algorithms in haskell, Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, vol.43, p.149, 1995.

J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D. K. Panda, Optimizing mechanisms for latency tolerance in remote memory access communication on clusters, IEEE International Conference on Cluster Computing, vol.57, p.138, 2003.

F. Berg-kjolstad and M. Snir, Ghost cell pattern, Proceedings of the 2010 Workshop on Parallel Programming Patterns, p.144, 2010.

B. Kågström and C. Loan, GEMM-based level-3 BLAS, p.114, 1991.

Y. Kwok, High-performance algorithms for compile-time scheduling of parallel processors, vol.42, p.134, 1997.

D. Lea and W. Gloger, A memory allocator, p.44, 1996.

A. Edward, S. Lee, and . Ha, Scheduling strategies for multiprocessor real-time dsp, Global Telecommunications Conference and Exhibition'Communications Technology for the 1990s and Beyond'(GLOBECOM), p.42, 1989.

J. Liu, W. Jiang, P. Wyckoff, K. Dhabaleswar, D. Panda et al., Design and implementation of mpich2 over infiniband with rdma support, Parallel and Distributed Processing Symposium, vol.56, p.81, 2004.

A. Edward, . Lee, G. David, and . Messerschmitt, Synchronous data flow. Proceedings of the IEEE, vol.75, p.142, 1987.

-. Edya-ladan and N. Shavit, An optimistic approach to lock-free fifo queues, International Symposium on Distributed Computing, p.63, 2004.

S. Yau-tsun-steven-li, A. Malik, and . Wolfe, Performance estimation of embedded software with instruction cache modeling, ACM Transactions on Design Automation of Electronic Systems (TODAES), vol.4, issue.3, p.15, 1999.

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, Nvidia tesla: A unified graphics and computing architecture, IEEE micro, vol.28, issue.2, 2008.

A. Edward, . Lee, M. Thomas, and . Parks, Dataflow process networks. Proceedings of the IEEE, vol.83, p.35, 1995.

T. Lepley, P. Paulin, and E. Flamand, A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory, Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, p.55, 2013.

K. Praveen, . Murthy, S. Shuvra, and . Bhattacharyya, Shared memory implementations of synchronous dataflow specifications, Design, Automation and Test in Europe Conference and Exhibition 2000. Proceedings, p.45, 2000.

J. Mellor-crummey, L. Adhianto, W. N. Scherer, I. , and G. Jin, A New Vision for Coarray Fortran, Proc. of the Third Conference on Partitioned Global Address Space Programing Models, PGAS '09, vol.5, p.53, 2009.

M. Maurer, C. Gerdes, B. Lenz, and H. Winner, Autonomous driving, 2016.

K. Matsuzaki, S. Hata, J. Hamano, Y. Kurashima, and M. Torii, Petri-net structured sequence control language with grafcet-like graphical expression for programmable controllers, Proc. IECON'85, p.35, 1985.

K. Mattila, J. Hyväluoma, J. Timonen, and T. Rossi, Comparison of implementations of the lattice-Boltzmann method, Computers & Mathematics with Applications, vol.55, issue.7, p.170, 2008.

M. Maged and . Michael, High performance dynamic lock-free hash tables and list-based sets, Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, p.63, 2002.

J. Mistry, M. Naylor, and J. Woodcock, Adapting freertos for multicores: An experience report. Software: Practice and Experience, vol.44, p.21, 2014.

H. Massalin and C. Pu, A lock-free multiprocessor os kernel, ACM SIGOPS Operating Systems Review, vol.26, issue.2, p.113, 1992.

M. Masmano, I. Ripoll, A. Peiró, and . Crespo, Xtratum for leon3: an open source hypervisor for high integrity systems, European Conference on Embedded Real Time Software and Systems, vol.2, p.24, 2010.
URL : https://hal.archives-ouvertes.fr/hal-02267841

J. M. Kevin, M. Martin, M. J. Rizk, J. Sepulveda, and . Diguet, Notifying memories: a case-study on data-flow applications with noc interfaces implementation, Proceedings of the 53rd Annual Design Automation Conference, vol.58, p.77, 2016.

S. Mcintosh, -. Smith, M. Boulton, D. Curran, and J. Price, On the performance portability of structured grid codes on many-core computer architectures, Supercomputing, p.173, 2014.

J. Nieplocha and B. Carpenter, Armci: A portable remote memory copy library for distributed array libraries and compiler run-time systems, Parallel and Distributed Processing, p.98

W. Robert, J. Numrich, and . Reid, Co-array fortran for parallel programming, SIGPLAN Fortran Forum, vol.17, issue.2, p.53, 1998.

J. Nieplocha, V. Tipparaju, M. Krishnan, and D. Panda, High performance remote memory access communication: The armci approach, International Journal of High Performance Computing Applications, vol.20, issue.2, p.69, 2006.

. Nvidia, Nvidia Tegra X1, 2015.

. Nvidia, Nvidia Drive PX 2, 2017.

. Nvidia, GPU-Accelerated Libraries for Computing, 2018.

K. John and . Ousterhout, An embeddable command language. Citeseer, p.81, 1989.

T. Ogasawara, An algorithm with constant execution time for dynamic storage allocation, Real-Time Computing Systems and Applications, 1995. Proceedings., Second International Workshop on, p.44, 1995.

A. Olofsson, Epiphany-v: A 1024 processor 64-bit RISC system-onchip, vol.6, p.51, 2016.

. Arb-openmp, , 2013.

J. S. Ostroff, Abstraction and composition of discrete real-time systems, Proc. of CASE, vol.95, p.115, 1995.

C. Obrecht, B. Tourancheau, and F. Kuznik, Performance Evaluation of an OpenCL Implementation of the Lattice Boltzmann Method on the Intel Xeon Phi, Parallel Processing Letters, vol.25, issue.03, p.167, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01286306

J. Papin, A Scheduling and Partitioning Model for Stencil-based Applications on Many-Core Devices, p.42, 2016.

M. Pelcat, S. Aridhi, J. Piat, and J. Nezan, Physical Layer Multi-Core Prototyping: A Dataflow-Based Approach for LTE eNodeB, p.115, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00739957

J. Piat, S. Shuvra, M. Bhattacharyya, and . Raulet, Interfacebased hierarchy for synchronous data-flow graphs, Signal Processing Systems, vol.115, p.116, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00440478

K. Popovici and A. Jerraya, Hardware abstraction layer, Hardware-dependent Software, vol.11, p.24, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00379166

M. Pelcat, P. Menuet, S. Aridhi, and J. Nezan, Scalable compile-time scheduler for multi-core architectures, Proceedings of the Conference on Design, Automation and Test in Europe, p.46, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00429393

M. Pelcat, J. F. Nezan, and S. Aridhi, Adaptive multicore scheduling for the LTE uplink, NASA/ESA Conference on Adaptive Hardware and Systems, p.42, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00488576

F. Richard and . Rashid, Designs for parallel architectures, Unix Review, vol.5, issue.4, p.16, 1987.

R. Mellanox and . Vs, , 1952.

M. Richard and . Russell, The cray-1 computer system, Communications of the ACM, vol.21, issue.1, p.13, 1978.

E. Rainey, J. Villarreal, G. Dedeoglu, K. Pulli, T. Lepley et al., Addressing system-level optimization with openvx graphs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, p.144, 2014.

K. Singh, M. Alam, and S. Sharma, A survey of static scheduling algorithm for distributed computing system, International Journal of Computer Applications, vol.129, issue.2

S. Saidi, R. Ernst, S. Uhrig, H. Theiling, and B. Dupont-de-dinechin, The shift to multicores in real-time and safety-critical systems, 2015 International Conference on Hardware/Software Codesign and System Synthesis, vol.17, p.51, 2015.

B. Spinean and G. Gaydadjiev, Implementation study of fft on multi-lane vector processors, Digital System Design (DSD), 2012 15th Euromicro Conference on, p.175, 2012.

S. Stuijk, M. Geilen, and T. Basten, Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs, Design Automation Conference, p.44, 2006.

T. Shanley, Infiniband Network Architecture. Addison-Wesley Professional, vol.69, p.74, 2003.

J. Daniel, . Sorin, D. Mark, D. Hill, and . Wood, A primer on memory consistency and cache coherence, Synthesis Lectures on Computer Architecture, vol.6, issue.3, p.60, 2011.

. Bibliography-;-eric, A. Stotzer, M. Jayaraj, A. Ali, G. Friedmann et al., Openmp on the low-power ti keystone ii arm/dsp system-on-chip, International Workshop on OpenMP, pp.114-127, 2013.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition, p.183, 2015.

A. Oram, S. Loosemore-with, R. M. Stallman, R. Mcgrath, and U. Drepper, The GNU C Library Reference Manual, vol.29, p.61, 2018.

A. Singh, M. Shafique, A. Kumar, and J. Henkel, Mapping on multi/many-core systems: survey of current and emerging trends, Proceedings of the 50th Annual Design Automation Conference, p.47, 2013.

S. Succi, The lattice Boltzmann equation: for fluid dynamics and beyond, p.166, 2001.

H. Sutter, The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's journal, vol.30, pp.202-210, 2005.

R. Tarjan, Depth-first search and linear graph algorithms, SIAM journal on computing, vol.1, issue.2, p.43, 1972.

S. Tripakis, D. Bui, M. Geilen, B. Rodiers, and E. Lee, Compositionality in synchronous data flow: Modular code generation from hierarchical sdf graphs, ACM Transactions on Embedded Computing Systems (TECS), vol.12, issue.3, p.39, 2013.

. Bart-d-theelen, C. W. Marc, T. Geilen, . Basten, P. M. Jeroen et al., A scenario-aware data flow model for combined long-run average and worst-case performance analysis. In Formal Methods and Models for Co-Design, MEMOCODE'06. Proceedings. Fourth ACM and IEEE International Conference on, p.40, 2006.

G. Tagliavini, G. Haugou, and L. Benini, Optimizing memory bandwidth in openvx graph execution on embedded many-core accelerators, Design and Architectures for Signal and Image Processing, p.144, 2014.

G. Tagliavini, G. Haugou, A. Marongiu, and L. Benini, Adrenaline: an openvx environment to optimize embedded vision applications on many-core accelerators, Embedded Multicore/Many-core Systems-on-Chip (MCSoC), p.144, 2015.

, Texas Instruments: Tms320c6678, 2013.

L. Torvalds, Linux: a portable operating system. Master's thesis, 1921.

L. Torvalds and D. , Just for fun: The story of an accidental revolutionary, 1921.

K. Vaidyanathan, L. Chai, W. Huang, and D. Panda, Efficient asynchronous memory copy operations on multi-core systems and i/oat. In Cluster Computing, IEEE International Conference on, p.57, 2007.

A. Varghese, B. Edwards, G. Mitra, and A. Rendell, Programming the adapteva epiphany 64-core network-on-chip coprocessor, Parallel & Distributed Processing Symposium Workshops (IPDPSW), p.51, 2014.

M. Verma, L. Wehmeyer, and P. Marwedel, Dynamic overlay of scratchpad memory for energy minimization, Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, p.59, 2004.
DOI : 10.1145/1016720.1016748

N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. Devito et al., Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, p.144, 2018.

R. Field-g-van-zee and . Van-de-geijn, Blis: A framework for rapidly instantiating blas functionality, ACM Transactions on Mathematical Software (TOMS), vol.41, issue.3, p.164, 2015.

O. Wing, Ladder network analysis by signal-flow graph-application to analog computer programming, IRE Transactions on Circuit Theory, vol.3, issue.4, p.35, 1956.
DOI : 10.1109/tct.1956.1086331

. Paul-r-wilson, S. Mark, M. Johnstone, D. Neely, and . Boles, Dynamic storage allocation: A survey and critical review, Memory Management, p.22, 1995.

A. Wm, S. A. Wulf, and . Mckee, Hitting the memory wall: implications of the obvious, ACM SIGARCH computer architecture news, vol.23, issue.1, p.21, 1995.

M. Wahib and N. Maruyama, Scalable kernel fusion for memorybound gpu applications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p.148, 2014.
DOI : 10.1109/sc.2014.21

W. Wolf, What is embedded computing?, Computer, vol.35, issue.1, pp.136-137, 2002.
DOI : 10.1109/2.976929

S. Wienke, P. Springer, C. Terboven, and . Dieter-an-mey, Openacc-first experiences with real-world applications, European Conference on Parallel Processing, vol.34, p.145, 2012.
DOI : 10.1007/978-3-642-32820-6_85

S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, issue.4, p.172, 2009.
DOI : 10.2172/1407078

URL : http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-134.pdf

C. Zinner and W. Kubinger, Ros-dma: a dma double buffering method for embedded image processing with resource optimized slicing, Real-Time and Embedded Technology and Applications Symposium, p.107, 2006.