1 Single Multicore Processor Intel Core i7 Q720 ( 4x Cores, 8x Threads, 4x L1 Cache 32KB, 4x L2 Cache 256KB, 1x L3 Cache 6MB, 1.6 GHz) This platform is equipped with a GPU ,
The platform has 32 GB of RAM (ECC Registred at 1600 Mhz ,
Intel Xeon X5560 Quad Core at 2.8 GHz". The remaining 38 Nodes are bi-processors board with 2 x "Intel Xeon X5677 Quad Core at 3.46 GHz". Therefore the CAPARMOR Supercomputer is Bibliography, Most of the Nodes MALLBA: A Library of Skeletons for Combinatorial Optimisation Proceedings of the 8th International Euro-Par Conference on Parallel Processing, Euro-Par '02, p.927 ,
LAPACK's User's Guide, 1992. ,
A view of the parallel computing landscape, Communications of the ACM, vol.52, issue.10, p.525667, 2009. ,
DOI : 10.1145/1562764.1562783
The Design of OpenMP Tasks. Parallel and Distributed Systems, IEEE Transactions on, vol.20, issue.3, p.404418, 2009. ,
The cost of security in skeletal systems, 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07), p.213220, 2007. ,
DOI : 10.1109/PDP.2007.79
MUSKEL: An Expandable Skeleton Environment, 2007. ,
Autonomic management of non-functional concerns in distributed & parallel application programming, 2009 IEEE International Symposium on Parallel & Distributed Processing, p.112, 2009. ,
DOI : 10.1109/IPDPS.2009.5161034
Fastow: high-level and ecient streaming on multi-core, Programming Multi-core and Many-core Computing Systems, ser. Parallel and Distributed Computing, S. Pllana, p.13, 2012. ,
An advanced environment supporting structured parallel programming in Java, Future Generation Computer Systems, vol.19, issue.5, p.611626, 2003. ,
DOI : 10.1016/S0167-739X(02)00172-3
Using Skeletons in a Java-Based Grid System ,
DOI : 10.1007/978-3-540-45209-6_103
Adapting Java RMI for grid computing, Future Generation Computer Systems, vol.21, issue.5, p.699707, 2005. ,
DOI : 10.1016/j.future.2004.05.010
The OpenCL specication version 1, 2011. ,
AMD64 Architecture Programmer's Manual, -Bit and 256-Bit XOP, FMA4 and CVT16 Instructions, p.128, 2009. ,
StarPU: A Unied Platform for Task Scheduling on Heterogeneous Multicore Architectures, Concurr. Comput. : Pract. Exper, vol.23, issue.2, p.187198, 2011. ,
Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model, Compiler Construction, p.132146 ,
An Environment for Structured Parallel Programming, Advances in High Performance Computing, p.219234, 1997. ,
Flexible Skeletal Programming with Eskel, Proceedings of the 11th International Euro-Par Conference on Parallel Processing, Euro-Par'05, p.761770, 2005. ,
Parallel FPGA-based all-pairs shortest-paths in a directed graph, Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'06), 2006. ,
A survey of multicore processors, IEEE Signal Processing Magazine, vol.26, issue.6, p.2637, 2009. ,
DOI : 10.1109/MSP.2009.934110
P3L: A structured high level programming language and its structured support, Concurrency: Practice and Experience, p.225255, 1995. ,
SkIE: A heterogeneous environment for HPC applications, Parallel Computing, vol.25, issue.13-14, pp.13-1418271852, 1999. ,
DOI : 10.1016/S0167-8191(99)00072-1
Performance analysis pf parallelizing compilers on the Perfect Benchmarks programs, IEEE Transactions on Parallel and Distributed Systems, vol.3, issue.6, p.643656, 1992. ,
DOI : 10.1109/71.180621
Polaris: Improving the Eectiveness of Parallelizing Compilers, Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing, LCPC '94, p.141154, 1995. ,
VFC: The Vienna Fortran Compiler, Scientific Programming, vol.7, issue.1, pp.67-81, 1999. ,
DOI : 10.1155/1999/304639
Implementation of a Portable Nested Data-parallel Language, SIGPLAN Not, vol.28, issue.7, p.102111, 1993. ,
Parallelizing dense and banded linear algebra libraries using SMPSs, Concurrency and Computation: Practice and Experience, p.24382456, 2009. ,
A Practical Automatic Polyhedral Parallelizer and Locality Optimizer, p.101113, 2008. ,
DOI : 10.1145/1375581.1375595
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.5126
Skil: an imperative language with algorithmic skeletons for efficient distributed programming, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing HPDC-96, p.243, 1996. ,
DOI : 10.1109/HPDC.1996.546194
The PARSEC benchmark suite, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, 2008. ,
DOI : 10.1145/1454115.1454128
Vector Models for Data-parallel Computing, 1990. ,
Programming parallel algorithms, Communications of the ACM, vol.39, issue.3, p.8597, 1996. ,
DOI : 10.1145/227234.227246
The Eden coordination model for distributed memory systems, Proceedings Second International Workshop on High-Level Parallel Programming Models and Supportive Environments, p.120, 1997. ,
DOI : 10.1109/HIPS.1997.582964
Pattern-based parallel programming, Proceedings of the International Conference on Parallel Programming, p.257265, 2002. ,
Hoard: A Scalable Memory Allocator for Multithreaded Applications, SIGOPS Oper. Syst. Rev, vol.34, issue.5, p.117128, 2000. ,
Automatic mapping of nested loops to FPGAs, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'07), 2007. ,
PLuTo: A Practical and Fully Automatic Polyhedral Parallelizer and Locality Optimizer, 2007. ,
Readings in Computer Vision: Issues, Problems, Principles, and Paradigms. chapter A Computational Approach to Edge Detection, p.184203, 1987. ,
Parallel Programmability and the Chapel Language, Int. J. High Perform. Comput. Appl, vol.21, issue.3, p.291312, 2007. ,
Parallel Programmability and the Chapel Language, Int. J. High Perform. Comput. Appl, vol.21, issue.3, p.291312, 2007. ,
SMP Superscalar (SMPSs) User's Manual Version 1, 2011. ,
Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures, IEEE Transactions on Parallel and Distributed Systems, vol.24, issue.12, p.23342343, 2013. ,
DOI : 10.1109/TPDS.2012.322
X10: An Object-oriented Approach to Non-uniform Cluster Computing, SIGPLAN Not, issue.10, pp.40519-538, 2005. ,
Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing, 2008 37th International Conference on Parallel Processing, p.536545, 2008. ,
DOI : 10.1109/ICPP.2008.88
Fine Tuning Algorithmic Skeletons, Proceedings of the 13th International Euro-Par Conference on Parallel Processing, Euro-Par'07, p.7281, 2007. ,
DOI : 10.1007/978-3-540-74466-5_9
A transparent non-invasive le data model for algorithmic skeletons, IEEE International Symposium on Parallel and Distributed Processing, p.110, 2008. ,
A Parallel Programming with Microsoft Visual C++: Design Patterns for Decomposition and Coordination on Multicore Architectures, 2011. ,
Domain Decomposition and Skeleton Programming with OCamlP31, Parallel Comput, vol.32, issue.7, p.539550, 2006. ,
Algorithmic Skeletons, 1991. ,
DOI : 10.1007/978-1-4471-0841-2_13
Bringing Skeletons out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming, Parallel Comput, vol.30, issue.3, p.389406, 2004. ,
C Language Extensions for Hybrid CPU/GPU Programming with StarPU, 2013. ,
The Munster Skeleton Library Muesli: A comprehensive overview, 2009. ,
QoS in Parallel Programming Through Application Managers, Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP '05, p.282289, 2005. ,
Cetus: A Source-to-Source Compiler Infrastructure for Multicores, Computer, issue.12, p.423642, 2009. ,
HOC-SA: a grid service architecture for higher-order components, IEEE International Conference onServices Computing, 2004. (SCC 2004). Proceedings. 2004, p.288294, 2004. ,
DOI : 10.1109/SCC.2004.1358017
LLC: A PARALLEL SKELETAL LANGUAGE, Proc. of the Second International Workshop on High Level Parallel Programming and Applications, p.7788, 2003. ,
DOI : 10.1142/S0129626403001409
Parallel Skeletons for Structured Composition, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, 1928. ,
Classics in Software Engineering. chapter Go to Statement Considered Harmful, p.2733, 1979. ,
OpenMP: an industry standard API for shared-memory programming, IEEE Computational Science and Engineering, vol.5, issue.1, p.4655, 1998. ,
DOI : 10.1109/99.660313
Scheduling and Automatic Parallelization, 2000. ,
DOI : 10.1007/978-1-4612-1362-8
URL : https://hal.archives-ouvertes.fr/hal-00856645
SKElib: Parallel Programming with Skeletons in C, Proceedings from the 6th International Euro-Par Conference on Parallel Processing, Euro-Par '00, p.11751184, 2000. ,
DOI : 10.1007/3-540-44520-X_166
Abstract Machine Models for Highly Parallel Computers, chapter Building Parallel Applications Without Programming, p.140154, 1995. ,
Generic Programming for Portable SIMDization, Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, p.431432 ,
Exploiting Multimedia Extensions in C++: A Portable Approach, Computing in Science and Engineering, vol.14, issue.5, p.7277, 2012. ,
SkePU, Proceedings of the fourth international workshop on High-level parallel programming and applications, HLPP '10, 2010. ,
DOI : 10.1145/1863482.1863487
A Scalable Concurrent malloc(3) Implementation for FreeBSD ,
The Design and Implementation of FFTW3, Special issue on "Program Generation, Optimization, and Platform Adaptation, p.216231, 2005. ,
DOI : 10.1109/JPROC.2004.840301
The Implementation of the Cilk-5 Multithreaded Language, SIGPLAN Not, vol.33, issue.5, p.212223, 1998. ,
JaSkel: a Java skeleton-based framework for structured cluster and grid computing, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), p.301304, 2006. ,
DOI : 10.1109/CCGRID.2006.65
S-Net for multi-memory multicores, Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming, DAMP '10, p.2534, 2010. ,
DOI : 10.1145/1708046.1708054
TCMalloc : Thread-Caching Malloc ,
Toward a toolchain for pipeline parallel programming on CMPs, Workshop on Software Tools for Multi-Core Systems, 2007. ,
Shared Memory Multiprocessor Support for SAC, Selected Papers from the 10th International Workshop on 10th International Workshop, IFL '98, p.3853, 1999. ,
DOI : 10.1007/3-540-48515-5_3
Shared Memory Multiprocessor Support for Functional Array Processing in SAC, J. Funct. Program, vol.15, issue.3, p.353401, 2005. ,
Self-adaptive skeletal task farm for computational grids, Parallel Computing, vol.32, issue.7-8, p.479490, 2006. ,
DOI : 10.1016/j.parco.2006.07.002
Adaptive structured parallelism for computational grids, Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '07, p.140141, 2007. ,
DOI : 10.1145/1229428.1229456
An adaptive parallel pipeline pattern for grids, 2008 IEEE International Symposium on Parallel and Distributed Processing, p.111, 2008. ,
DOI : 10.1109/IPDPS.2008.4536264
A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers, Software: Practice and Experience, vol.21, issue.6 ,
DOI : 10.1002/spe.1026
Code Generation in Action, 2003. ,
HDC: A Higher-Order Language for Divide-and-Conquer, 2000. ,
Stream compaction for deferred shading, Proceedings of the 1st ACM conference on High Performance Graphics, HPG '09, p.173180, 2009. ,
DOI : 10.1145/1572769.1572797
Comparing the Performance of Dierent x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi-and Manycore Chips, Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, p.5764, 2014. ,
Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP, IEICE Transactions on Information and Systems, vol.89, issue.2 ,
DOI : 10.1093/ietisy/e89-d.2.399
Automatic Parallelization with Intel Compilers, https://software.intel.com/en-us/articles/automatic-parallelization- with-intel-compilers ,
Intel Core i7-720QM, http://ark.intel.com/products ,
OSL: Optimized Bulk Synchronous Parallel Skeletons on Distributed Arrays, Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies, APPT '09, p.436451, 2009. ,
DOI : 10.1145/79173.79181
URL : https://hal.archives-ouvertes.fr/inria-00452523
A study of possible optimizations for the task scheduler "QUARK" on the shared memory architecture, 2013. ,
Trends in multicore DSP platforms, Signal Processing Magazine, issue.6, p.263849, 2009. ,
Multicore software technologies, Signal Processing Magazine IEEE, vol.26, issue.6, p.8089, 2009. ,
MHPM: Multi-Scale Hybrid Programming Model: A Flexible Parallelization Methodology, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, p.7180 ,
DOI : 10.1109/HPCC.2012.20
Implementation Skeletons in Eden: Low-Effort Parallel Programming, Selected Papers from the 12th International Workshop on Implementation of Functional Languages , IFL '00, p.7188, 2001. ,
DOI : 10.1007/3-540-45361-X_5
Multithreading in the PLASMA Library, Handbook of Multi and Many-Core Processing: Architecture, Algorithms, Programming, and Applications. Chapman and Hall/CRC, 2014. ,
Complete x86/x64 JIT and Remote Assembler for C++, https ,
Metaprogramming in C++, www.cs.tut ,
The parallel execution of DO loops, Communications of the ACM, vol.17, issue.2, p.8393, 1974. ,
DOI : 10.1145/360827.360844
The Problem with Threads, Computer, vol.39, issue.5, p.3342, 2006. ,
DOI : 10.1109/MC.2006.180
The Cilk++ Concurrency Platform, Proceedings of the 46th Annual Design Automation Conference, DAC '09, pp.522-527, 2009. ,
Parallel VSIPL++: An Open Standard Software Library for High-Performance Parallel Signal Processing, Proceedings of the IEEE, p.313330, 2005. ,
DOI : 10.1109/JPROC.2004.840303
Control Flow Emulation on Tiled SIMD Architectures, Proceedings of the Joint European Conferences on Theory and Practice of Software 17th International Conference on Compiler Construction, CC'08/ETAPS'08, p.100115, 2008. ,
DOI : 10.1007/978-3-540-78791-4_7
Ortega-mallén, and R. Peña marí. Parallel Functional Programming in Eden, J. Funct. Program, vol.15, issue.3, p.431475, 2005. ,
Skandium: Multi-core Programming with Algorithmic Skeletons, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, p.289296, 2010. ,
DOI : 10.1109/PDP.2010.26
The Design of a Task Parallel Library, Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, p.227242, 2009. ,
Understanding Digital Signal Processing. Prentice-Hall accounting series, 2010. ,
An Engine, Not a Camera: How Financial Models Shape Markets. Inside Technology, 2008. ,
DOI : 10.7551/mitpress/9780262134606.001.0001
Structured Parallel Programming with Deterministic Patterns, Proceedings of the 2Nd USENIX Conference on Hot Topics in Parallelism, HotPar'10, p.55 ,
Particle-based Fluid Simulation for Interactive Applications, Proceedings of the 2003 ACM SIG- GRAPH/Eurographics Symposium on Computer Animation, SCA '03, p.154159, 2003. ,
Processor Information, http://msdn.microsoft.com/en- us/library/windows/desktop/ms683194 [Micb] Microsoft. Task Parallel Library ,
A library of constructive skeletons for sequential style of parallel programming, Proceedings of the 1st international conference on Scalable information systems , InfoScale '06, 2006. ,
DOI : 10.1145/1146847.1146860
Structured Parallel Programming: Patterns for Ecient Computation, 2012. ,
Patterns for Parallel Programming, 2004. ,
Rethinking the pipeline as object-oriented states with transformations, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings., p.1221, 2004. ,
DOI : 10.1109/HIPS.2004.1299186
Performance Evaluation of GPUs Using the RapidMind Development Platform, Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, 2006. ,
QUAD ??? A Memory Access Pattern Analyser, Proceedings of the 6th International Conference on Recongurable Computing: Architectures, Tools and Applications, ARC'10, p.269281, 2010. ,
DOI : 10.1007/978-3-642-12133-3_25
An Infrastructure for Video-Augmented Environments ,
Fluid Mechanics, 1992. ,
DOI : 10.1017/CBO9781139172561
A compact ducial for ane augmented reality, Proceedings of the 2005 IEEE International Conference on Visual Information Engineering , VIE'05, p.347352, 2005. ,
MMX technology extension to the Intel architecture, IEEE Micro, vol.16, issue.4, p.4250, 1996. ,
DOI : 10.1109/40.526924
Automatic Hybrid MPI+OpenMP Code Generation with llc, Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, p.185195, 2009. ,
DOI : 10.1007/978-3-642-03770-2_25
Composable Parallel Patterns with Intel Cilk Plus, Computing in Science & Engineering, vol.15, issue.2, p.6671, 2013. ,
DOI : 10.1109/MCSE.2013.21
Implementing streaming SIMD extensions on the Pentium III processor, IEEE Micro, vol.20, issue.4, pp.47-57, 2000. ,
DOI : 10.1109/40.865866
The Boost C++ Libraries, 2011. ,
Skeletons for parallel image processing: an overview of the SKIPPER project, Parallel Computing, vol.28, issue.12, p.16851708, 2002. ,
DOI : 10.1016/S0167-8191(02)00189-8
Introspective C++, 2004. ,
Using Processor-Cache Anity Information in Shared-Memory Multiprocessor Scheduling, IEEE Trans. Parallel Distrib. Syst, vol.4, issue.2, p.131143, 1993. ,
The Standard Template Library, WG21/N0482, ISO Programming Language C++ Project, 1994. ,
Modern Processor Design: Fundamentals of Superscalar Processors, 2002. ,
Models and languages for parallel computation, ACM Computing Surveys, vol.30, issue.2, p.123169, 1998. ,
DOI : 10.1145/280277.280278
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.1801
Real-Time Fluid Dynamics for Games, 2003. ,
A Unied Model for Multicore Architectures, Proceedings of the 1st International Forum on Next-generation Multicore/Manycore Technologies, IFMT '08, p.12, 2008. ,
Thread Clustering: Sharing-aware Scheduling on SMP-CMP-SMT Multiprocessors, SIGOPS Oper. Syst. Rev, vol.41, issue.3, p.4758, 2007. ,
Using Generative Design Patterns to Generate Parallel Code for a Distributed Memory Environment, SIGPLAN Not, issue.10, p.38203215, 2003. ,
Evaluation of UPC programmability using classroom studies, Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, PGAS '09, pp.110-117, 2009. ,
DOI : 10.1145/1809961.1809975
The Programming Model of ASSIST, an Environment for Parallel and Distributed Portable Applications, Parallel Comput, vol.28, issue.12, p.17091732, 2002. ,
SESAM/Par4All: A Tool for Joint Exploration of MPSoC Architectures and Dynamic Dataow Code Generation, Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO '12, p.916, 2012. ,
Dynamic storage allocation: A survey and critical review, Proceedings of the International Workshop on Memory Management, IWMM '95, p.1116, 1995. ,
DOI : 10.1007/3-540-60368-9_19
Multiprocessor system-on-chip technology, IEEE Signal Processing Magazine, vol.26, issue.6, 2009. ,
DOI : 10.1109/MSP.2009.934138
Dynamic Task Execution on Shared and Distributed Memory Architectures, 2012. ,
Cache-aware task scheduling on multi-core architecture, International Symposium on VLSI Design Automation and Test (VLSI-DAT), p.139142, 2010. ,
execution time on a 16 Threads SMP platform with two Intel Xeon E5620 Processor at 2.4, p.130 ,
Line count of the sequential version and the parallel versions using XPU (vectorized), p.131 ,
Line count of the sequential version and the vectorized parallel versions using XPU (Vectorized, OpenMP+SSE, p.131 ,
generates unnecessary idles times when executing certain Task Graphs (DAG) 155 8.4 The super-scalar execution model used by FATMA, SMPSS or Quark executes asynchronously the tasks and use event-based peer-to-peer synchronization model between dependent task, This allows FATMA to eliminate unnecessary idles times when executing Task Graphs, p.155 ,
SMPSs implementations of the tiled Cholesky factorization on and 8 Threads Intel Core i7 Q720 processor, p.176 ,
Static Scheduling) implementations of the tiled dgesv on an SMP platform with 2 x Intel Xeon E5620 at 2.4 GHz (16 Hardware Threads), p.178 ,