A. Almási, C. Archer, J. Castaños, C. Erway, P. Heidelberger et al., Implementing MPI on the BlueGene/L Supercomputer, Euro-Par 2004 Parallel Processing, pp.833-845, 2004.
DOI : 10.1007/978-3-540-27866-5_112

C. Agullo, J. Augonnet, H. Dongarra, R. Ltaief, S. Namyst et al., A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs, GPU Computing Gems, p.31, 2010.
DOI : 10.1016/B978-0-12-385963-1.00034-4

C. Agullo, J. Augonnet, M. Dongarra, J. Faverge, and . Langou, Hatem Ltaief, and Stanimire Tomov. LU factorization for accelerator-based systems, 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), p.31, 2011.

C. Agullo, J. Augonnet, M. Dongarra, H. Faverge, S. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, p.31, 2011.
DOI : 10.1109/IPDPS.2011.90

URL : https://hal.archives-ouvertes.fr/inria-00547614

S. Alam, R. Barrett, M. Bast, M. R. Fahey, J. Kuehn et al., Early evaluation of IBM BlueGene/P, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-2312, 2008.
DOI : 10.1109/SC.2008.5214725

. Ayguadé, M. Rosa, . Badia, D. Francisco, . Igual et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, pp.851-862, 2009.
DOI : 10.1109/TPDS.2003.1214317

A. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin et al., The Design of OpenMP Tasks, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.3, pp.404-418, 2009.
DOI : 10.1109/TPDS.2008.105

T. E. Anderson, D. E. Culler, and D. Patterson, A case for now (networks of workstations). Micro, IEEE, vol.15, issue.1, pp.54-64, 1995.

K. [. Adve and . Gharachorloo, Shared memory consistency models: a tutorial, Computer, vol.29, issue.12, pp.66-76, 1996.
DOI : 10.1109/2.546611

J. Antony, P. P. Janes, and A. P. , Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport, Proceedings of the International Conference on High Performance Computing (HiPC), p.29, 2006.
DOI : 10.1007/11945918_35

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.7975

. Inc, Advanced Micro Devices HyperTransport Technology I/O Link, A High-Bandwidth I/O Architecture, p.29, 2001.

D. Akihiro, W. Toshihiko, and N. Hideki, Packaging technology for the NEC SX-3/SX-X Supercomputer, 40th Conference Proceedings on Electronic Components and Technology, pp.525-533, 1990.
DOI : 10.1109/ECTC.1990.122238

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU : A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation : Practice and Experience, Special Issue : Euro-Par, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

. Bailey, J. Barszcz, R. Barton, T. Carter, D. Lasinski et al., The Nas Parallel Benchmarks, PVM/MPI, pp.63-73, 1991.
DOI : 10.1177/109434209100500306

. Bbg-+-10-]-pavan, D. Balaji, D. Buntinas, W. Goodell, R. Gropp et al., Fine-grained multithreading support for hybrid threaded mpi programming, IJHPCA, vol.24, pp.49-57, 2010.

N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz et al., Myrinet: a gigabit-per-second local area network, Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010), pp.29-36, 1995.
DOI : 10.1109/40.342015

R. Bronis, M. De-supinski, D. Schulz, T. Franklin, F. T. Sherwood et al., Exploiting data similarity to reduce memory footprints, IPDPS, pp.152-163, 2011.

J. M. Bull, J. P. Enright, and N. Ameer, A Microbenchmark Suite for Mixed-Mode OpenMP/MPI, IWOMP'09, pp.118-131, 2009.
DOI : 10.1155/2001/450503

J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel, A multithreaded PowerPC processor for commercial servers, IBM Journal of Research and Development, vol.44, issue.6, pp.885-898, 2000.
DOI : 10.1147/rd.446.0885

J. Daniel and . Berg, Java threads -a white paper, p.37, 1996.

N. Broquedis, B. Furmento, P. Goglin, R. Wacrenier, and . Namyst, ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures, International Journal of Parallel Programming, vol.62, issue.5-6, pp.418-439, 2010.
DOI : 10.1007/s10766-010-0136-3

URL : https://hal.archives-ouvertes.fr/inria-00496295

D. Buntinas, G. Mercier, and W. Gropp, Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem, Parallel Computing, vol.33, issue.9, pp.634-644, 2007.
DOI : 10.1016/j.parco.2007.06.003

URL : https://hal.archives-ouvertes.fr/hal-00344327

P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta, Exploiting Locality on the Cell/B.E. through Bypassing, Proceedings of the 9th International Workshop on Embedded Computer Systems : Architectures, Modeling, and Simulation, SAMOS '09, pp.318-328, 2009.
DOI : 10.1147/rd.515.0593

[. Broquedis, De l'exécution d'applications scientifiques OpenMP sur architectures hiérarchiques, p.71, 2010.

F. Cappello, E. Caron, M. Dayde, F. Desprez, Y. Jegou et al., Grid'5000: a large scale and highly reconfigurable grid experimental testbed, The 6th IEEE/ACM International Workshop on Grid Computing, 2005., pp.99-106, 2005.
DOI : 10.1109/GRID.2005.1542730

URL : https://hal.archives-ouvertes.fr/hal-00684943

W. William, J. M. Carlson, D. E. Draper, K. Culler, E. Yelick et al., Introduction to upc and language specification, for Computing Sciences, p.46, 1999.

N. Conway, G. Kalyanasundharam, K. Donley, B. Lepak, and . Hughes, Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor, IEEE Micro, vol.30, issue.2, pp.16-29, 2010.
DOI : 10.1109/MM.2010.31

I. Corp, An Introduction to the Intel QuickPath Interconnect, pp.29-30, 2009.

M. Daga, A. Aji, and W. Feng, On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing, 2011 Symposium on Application Accelerators in High-Performance Computing, p.32, 2011.
DOI : 10.1109/SAAHPC.2011.29

R. Dolbeau, F. Bihan, and . Bodin, HMPP : A hybrid multi-core parallel programming environment, pp.1-5, 2007.

D. Dalessandro, A. Devulapalli, and P. Wyckoff, iWarp protocol kernel space software implementation, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, pp.274-274, 2006.
DOI : 10.1109/IPDPS.2006.1639565

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.3563

E. Demaine, A threads-only mpi implementation for the development of parallel programs The native posix thread library for linux, In : Proceedings of the 11th International Symposium on High Performance Computing Systems, pp.153-163, 1997.

V. Danjean, R. Namyst, and P. Wacrenier, An Efficient Multi-level Trace Toolkit for Multi-threaded Applications, EuroPar, p.81, 2005.
DOI : 10.1007/11549468_21

URL : https://hal.archives-ouvertes.fr/hal-00360309

J. Dinan, S. Olivier, G. Sabin, P. Sadayappan, and C. Tseng, Dynamic load balancing of unbalanced computations using message passing Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs, IEEE International Parallel and Distributed Processing Symposium Proceedings of SHPCC'94, pp.1-8, 1994.

M. Farreras, T. Cortes, J. Labarta, and G. Almasi, Scaling mpi to shortmemory mpps such as bg/l, Proceedings of the 20th annual international conference on Supercomputing, ICS '06, pp.209-218, 2006.

M. Frigo, C. E. Leiserson, and K. H. Randall, The implementation of the Cilk-5 multithreaded language, Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pp.212-223, 1998.
DOI : 10.1145/277652.277725

[. Pellegrini, Scotch and LibScotch 5.1 User's Guide. ScAlApplix project, INRIA Bordeaux ? Sud-Ouest, ENSEIRB & LaBRI, UMR CNRS 5800, Ful99] S. Fuller. Motorola's altivec technology. Networking & Computing Core Technology, pp.72-82, 1999.
URL : https://hal.archives-ouvertes.fr/hal-00410332

F. García, A. Calderón, and J. Carretero, MiMPI: A Multithread-Safe Implementation of MPI, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.674-674, 1999.
DOI : 10.1007/3-540-48158-3_26

E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra et al., Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.353-377, 2004.
DOI : 10.1007/978-3-540-30218-6_19

T. S. Graham and . Woodall, Open MPI : Goals, concept, and design of a next generation MPI implementation, Proceedings, 11th European PVM/MPI Users' Group Meeting, pp.97-104, 2004.

D. Goodell, W. Gropp, X. Zhao, and R. Thakur, Scalable Memory Use in MPI: A Case Study with MPICH2, Recent Advances in the Message Passing Interface, pp.140-149, 2011.
DOI : 10.1007/978-3-642-24449-0_17

B. Goglin, High-performance message-passing over generic Ethernet hardware with Open-MX, Parallel Computing, vol.37, issue.2, pp.85-100, 2011.
DOI : 10.1016/j.parco.2010.11.001

URL : https://hal.archives-ouvertes.fr/inria-00533058

G. Howard and A. Kopser, Mpich2 : A new start for mpi implementations Design of the tera mta integrated circuits, Recent Advances in Parallel Virtual Machine and Message Passing Interface Gallium Arsenide Integrated Circuit (GaAs IC) Symposium 19th Annual, pp.31-45, 1997.

S. Habata, K. Umezawa, M. Yokokawa, and S. Kitawaki, Hardware system of the Earth Simulator, Parallel Computing, vol.30, issue.12, pp.1287-1313, 2004.
DOI : 10.1016/j.parco.2004.09.004

. Ibm and . Ibm, Next generation posix threading, p.37

G. Jin, J. Mellor-crummey, L. Adhianto, W. N. Scherer, and C. Yang, http://software.intel.com/ en-us/articles/intel-cilk-plus-specification Implementation and performance evaluation of the hpc challenge benchmarks in coarray fortran 2.0, Int11] Intel. Intel cilk plus specification Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pp.40-1089, 2011.

]. D. Khr11-]-khronos-opencl-working-groupkm03, D. T. Koufaty, and . Marr, The OpenCL Specification, version 1.1, 6 Hyperthreading technology in the netburst microarchitecture, pp.3156-65, 2003.

C. Kurmann, F. Rauch, and T. Stricker, Speculative defragmentation leading gigabit ethernet to true zero-copy communication, Cluster Computing, vol.4, issue.1, pp.7-18, 2001.
DOI : 10.1023/A:1011456024871

X. Leroy, The linux threads library, p.36, 1999.

J. Laudon and D. Lenoski, The sgi origin : a ccnuma highly scalable server, Proceedings of the 24th annual international symposium on Computer architecture, ISCA '97, pp.241-251, 1997.

J. Laudon and D. Lenoski, The sgi origin : a ccnuma highly scalable server, Proceedings of the 24th annual international symposium on Computer architecture, ISCA '97, pp.241-251, 1997.

M. Litzkow, M. Livny, and M. Mutka, Condor-a hunter of idle workstations, [1988] Proceedings. The 8th International Conference on Distributed, p.11, 1988.
DOI : 10.1109/DCS.1988.12507

D. B. Loveman, High performance Fortran, IEEE Parallel & Distributed Technology: Systems & Applications, vol.1, issue.1, pp.25-42, 1993.
DOI : 10.1109/88.219857

G. Mercier and J. Clet-ortega, Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments, In EuroPVM/MPI Lecture Notes in Computer Science, vol.5759, issue.103, pp.104-115, 2009.
DOI : 10.1007/978-3-642-03770-2_17

URL : https://hal.archives-ouvertes.fr/inria-00392581

E. Gordon and . Moore, Progress in digital integrated electronics, Electron Devices Meeting, pp.11-13, 1975.

E. Gordon and . Moore, ff. Solid-State Circuits Newsletter [MPIa] The message passing interface (mpi) standard. http://www.mcs.anl.gov, MPIb] Message passing interface (mpi) forum, pp.11433-11468, 1965.

D. [. Mcnairy and . Soltis, Itanium 2 processor microarchitecture, IEEE Micro, vol.23, issue.2, pp.44-55, 2003.
DOI : 10.1109/MM.2003.1196114

R. Namyst, PM2 : un environnement pour une conception portable et une exécution efficace des applications parallèles irrégulières, p.37, 1997.

W. Robert, J. Numrich, and . Reid, Co-array fortran for parallel programming, SIGPLAN Fortran Forum, vol.17, pp.1-31, 1998.

P. Noeth, F. Ratn, M. Mueller, B. R. Schulz, and . De-supinski, ScalaTrace: Scalable compression and replay of communication traces for high-performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.8, pp.696-710, 2009.
DOI : 10.1016/j.jpdc.2008.09.001

J. D. Owens, D. L. Lefohn, and T. J. Purcell, IEEE Standards Office A survey of general-purpose computation on graphics hardware, Science : IEEE Std. Computer Graphics Forum, vol.26, issue.1, pp.1596-1992, 1993.

I. Pratt and K. Fraser, Arsenic: a user-accessible gigabit Ethernet interface, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213), pp.67-76, 2001.
DOI : 10.1109/INFCOM.2001.916688

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4189

F. Petrini, A. Wu-chun-feng, S. Hoisie, E. Coll, and . Frachtenberg, The Quadrics network : high-performance clustering technology Openmpspy : Leveraging quality assurance for parallel software, Proceedings of the 17th international conference on Parallel processing -Volume Part II, Euro-Par'11, pp.124-135, 2011.

S. Pakin, M. Lauria, and A. Chien, Available from http://www. c3.lanl.gov/PAL/publications/papers/Pakin1995:FM.pdfPLP] Portable Linux Processor Affinity. http://www.open-mpi.org/projects/ plpa Bip : a new protocol designed for high performance networking on myrinet Mmx technology extension to the intel architecture Multithreaded global address space communication techniques for gyrokinetic fusion applications on ultra-scale platforms, High performance messaging on workstations : Illinois Fast Messages (FM) for Myrinet Proceedings of the 1995 ACM Workshop PC-NOW, IPPS/SPDP98 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1528-1557, 1995.

J. Reinders, Intel threading building blocks -outfitting C++ for multicore processor parallelism. O'Reilly Réseau national de télécommunications pour la technologie, l'enseignement et la recherche, pp.40-54, 2007.

K. Ravichandran, S. Lee, and S. Pande, Work Stealing for Multi-core HPC Clusters, Euro-Par 2011 Parallel Processing, pp.205-217, 2011.
DOI : 10.1145/568014.379563

S. K. Raman, V. Pentkovski, and J. Keshava, Implementing streaming SIMD extensions on the Pentium III processor, IEEE Micro, vol.20, issue.4, pp.47-57, 2000.
DOI : 10.1109/40.865866

S. K. Raman, V. Pentkovski, and J. Keshava, Implementing streaming simd extensions on the pentium iii processor. Micro, IEEE, vol.20, issue.4, pp.47-57, 2000.

R. M. Russell, The CRAY-1 computer system, Communications of the ACM, vol.21, issue.1, pp.63-72, 1978.
DOI : 10.1145/359327.359336

V. S. Sunderam, G. A. Geist, J. Dongarra, and R. Manchek, The PVM concurrent computing system: Evolution, experiences, and trends, Parallel Computing, vol.20, issue.4, pp.531-546, 1994.
DOI : 10.1016/0167-8191(94)90027-2

]. A. Sod05, T. Sodan, D. Sterling, D. J. Savarese, J. E. Becker et al., Message-passing and shared-data programming models -wish vs. reality. High Performance Computing Systems and Applications BEOWULF : A parallel workstation for scientific computation, Annual International Symposium on Proceedings of the 24th International Conference on Parallel Processing, pp.65-131, 1995.

P. Shivam, P. Wyckoff, and D. Panda, EMP, Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '01, pp.49-66, 2001.
DOI : 10.1145/582034.582091

D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous multithreading : Maximizing on-chip parallelism, Computer Architecture, 1995. Proceedings ., 22nd Annual International Symposium on, pp.392-403, 1995.

R. Thakur and W. Gropp, Test suite for evaluating performance of multithreaded MPI communication, Parallel Computing, vol.35, issue.12, pp.608-617, 2009.
DOI : 10.1016/j.parco.2008.12.013

J. E. Thornton, The CDC 6600 Project, IEEE Annals of the History of Computing, vol.2, issue.4, pp.338-348, 1980.
DOI : 10.1109/MAHC.1980.10044

H. Tamura, S. Kamiya, and T. Ishigai, Facom vp-100 Supercomputers with ease of use, Top500 Supercomputing Sites, pp.87-107, 1985.

F. Trahay, F. Rue, M. Faverge, Y. Ishikawa, R. Namyst et al., EZTrace : a generic framework for performance analysis Poster Session Optimizing threaded mpi execution on smp clusters, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) IN PROC. OF 15TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, pp.81-381, 2001.

N. Uchida, M. Hirai, M. Yoshida, and K. Hotta, Fujitsu vp2000 series, Compcon Spring '90. Intellectual Leverage. Digest of Papers. Thirty-Fifth IEEE Computer Society International Conference, pp.4-11, 1990.
DOI : 10.1109/cmpcon.1990.63645

V. , V. Volkov, J. W. Demmel, and K. E. Schauser, Benchmarking gpus to tune dense linear algebra Active Messages : a Mechanism for Integrated Communication and Computation, Proceedings of the 2008 ACM Proceedings of the 19th Int'l Symp. on Computer Architecture, pp.46-77, 1992.

T. Watari and H. Murano, Packaging technology for the nec sx supercomputer . Components, Hybrids, and Manufacturing TechnologyXCA] Xcalablemp : Directive-based language extension for scalable and performance-aware parallel programming, IEEE Transactions on, vol.8, issue.4, pp.462-467, 1985.

.. Pingpong-thread_single-bertha, Impact du placement de 2 processus à 4 threads sur la machine, p.104

.. Pingpong-thread_multiple-bertha, Impact du placement de 2 processus à 4 threads sur la machine, p.105

.. Pingpong-thread_single-kwak, Impact du placement de 2 processus à 4 threads sur la machine, p.106

.. Pingpong-thread_multiple-kwak, Impact du placement de 2 processus à 4 threads sur la machine, p.107

.. Pingpong-thread_multiple-fourmi, Impact du placement de 2 processus à 4 threads sur la grappe de calcul