A. Figure, 3: Actual implementation of the tile Cholesky hybrid algorithm with StarPU

M. Amini, C. Ancourt, F. Coelho, F. Irigoin, P. Jouvelot et al., PIPS Is not (just) Polyhedral Software, International Workshop on Polyhedral Compilation Techniques (IMPACT'11), 2011.
URL : https://hal.archives-ouvertes.fr/hal-00744312

M. Amini, F. Coelho, F. Irigoin, and R. Keryell, Static compilation analysis for host-accelerator communication optimization, 24th Int. Workshop on Languages and Compilers for Parallel Computing (LCPC), Fort Collins, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00743496

E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. Mckenney et al., Lapack: a portable linear algebra library for highperformance computers, Proceedings of the 1990 ACM/IEEE conference on Supercomputing, Supercomputing '90, pp.2-11, 1990.

D. Andrade, B. B. Fraguela, J. Brodman, and D. Padua, Task-parallel versus data-parallel library-based programming in multicore systems, Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp.101-110, 2009.

J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao et al., Petabricks: a language and compiler for algorithmic choice, Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation , PLDI '09, pp.38-49, 2009.

I. Apple, Apple Technical Brief on Grand Central Dispatch, 2009.

I. Apple, Introducing Blocks and Grand Central Dispatch, 2010.

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands et al., The landscape of parallel computing research: A view from berkeley, 2006.

E. Ayguade, R. M. Badia, D. Cabrera, A. Duran, M. Gonzalez et al., A proposal to extend the openmp tasking model for heterogeneous architectures, IWOMP '09: Proceedings of the 5th International Workshop on OpenMP, pp.154-167, 2009.

E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proceedings of the 15th Euro-Par Conference, 2009.

R. M. Badia, J. Labarta, R. Sirvent, J. M. Prez, J. M. Cela et al., Programming grid applications with grid superscalar, Journal of Grid Computing, vol.1, issue.10, pp.151-170, 1023.
DOI : 10.1023/b:grid.0000024072.93701.f3

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96.9646

J. Balart, A. Duran, M. Gonzlez, X. Martorell, E. Ayguad et al., Nanos mercurium: a research compiler for openmp, European Workshop on OpenMP, pp.103-109, 2004.

. Barcelona-supercomputing and . Center, SMP Superscalar (SMPSs) User's Manual, Version 2.0, 2008.

U. Muthu-manikandan-baskaran, S. Bondhugula, J. Krishnamoorthy, A. Ramanujam, P. Rountev et al., A compiler framework for optimization of affine loop nests for gpgpus, Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pp.225-234, 2008.

J. Muthu-manikandan-baskaran, P. Ramanujam, and . Sadayappan, Automatic c-to-cuda code generation for affine programs, CC'10, pp.244-263, 2010.

M. Bauer, J. Clark, E. Schkufza, and A. Aiken, Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia, Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, pp.13-24, 2011.

P. Bellens, J. M. Pérez, F. Cabarcas, A. Ramírez, R. M. Badia et al., Cellss: Scheduling techniques to better exploit memory hierarchy, Scientific Programming, pp.77-95, 2009.
DOI : 10.1155/2009/561672

URL : http://doi.org/10.1155/2009/561672

L. S. Blackford, J. Choi, A. Cleary, E. D. 'azeuedo, J. Demmel et al., ScaLAPACK user's guide, 1997.
DOI : 10.1137/1.9780898719642

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier et al., DAGuE: A generic distributed DAG engine for high performance computing , SEP 2010

B. Bouzas, R. Cooper, J. Greene, M. Pepe, and M. J. Prelle, MultiCore Framework: An API for Programming Heterogeneous Multicore Processors, Proc. of First Workshop on Software Tools for Multi-Core Systems, 2006.

T. Brandes, Exploiting advanced task parallelism in high performance fortran via a task library, Euro-Par99 Parallel Processing, pp.833-844, 1999.

F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P. Wacrenier et al., Structuring the execution of OpenMP applications for multicore architectures, Proceedings of 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00441472

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications, Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010), pp.180-186, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00429889

F. Broquedis, N. Furmento, B. Goglin, P. Wacrenier, and R. Namyst, ForestGOMP: an efficient OpenMP environment for NUMA architectures, International Journal on Parallel Programming, Special Issue on OpenMP, vol.38, issue.5, pp.418-439, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00496295

E. Brunet, F. Trahay, A. Denis, and R. Namyst, A samplingbased approach for communication libraries auto-tuning, IEEE International Conference on Cluster Computing, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00605735

I. Buck, T. Foley, D. Reiter-horn, J. Sugerman, K. Fatahalian et al., Brook for GPUs: stream computing on graphics hardware, ACM Trans. Graph, vol.23, issue.3, pp.777-786, 2004.

J. Bueno, A. Duran, X. Martorell, E. Ayguadé, R. M. Badia et al., Poster: programming clusters of gpus with ompss, Proceedings of the international conference on Supercomputing, pp.378-378, 2011.

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, 2007.

D. Campbell, Vsipl++ acceleration using commodity graphics processors, Proceedings of the HPCMP Users Group Conference, pp.315-320, 2006.
DOI : 10.1109/hpcmp-ugc.2006.77

C. Louis, E. Canon, and . Jeannot, Evaluation and optimization of the robustness of dag schedules in heterogeneous environments, IEEE Transactions on Parallel and Distributed Systems, vol.99, issue.RapidPosts, pp.532-546, 2009.

P. Carpenter, Running Stream-like Programs on Heterogeneous Multi-core Systems, 2011.

S. Carr and K. Kennedy, Blocking linear algebra codes for memory hierarchies, Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing, 1989.

E. Chan, Runtime data flow scheduling of matrix computations, 2009.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, IEEE Workload Characterization Symposium, vol.0, pp.44-54, 2009.

L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, Dynamic load balancing on singleand multi-gpu systems, Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp.1-12, 2010.
DOI : 10.1109/ipdps.2010.5470413

URL : http://cacs.usc.edu/education/cs653/Chen-LoadBalanceGPU-IPDPS10.pdf

D. Sylvain-collange, A. Defour, and . Tisserand, Power Consumption of GPUs from a Software Perspective, 9th International Conference on Computational Science, pp.914-923, 2009.

I. A. Corp and . Quick, Easy and Reliable Way to Improve Threaded Performance: Intel Cilk Plus, 2010.

M. Cosnard and M. Loi, Automatic task graph generation techniques, Hawaii International Conference on System Sciences, p.113, 1995.
DOI : 10.1109/hicss.1995.375471

M. Cosnard, E. Jeannot, and T. Yang, Slc: Symbolic scheduling for executing parameterized task graphs on multiprocessors, Proceedings of the 1999 International Conference on Parallel Processing, ICPP '99, p.413, 1999.
URL : https://hal.archives-ouvertes.fr/inria-00098842

K. Coulomb, M. Faverge, J. Jazeix, O. Lagrasse, J. Marcoueille et al., Arthur Redondy, and Clment Vuchener. Vite's project page, 2009.

C. H. Crawford, P. Henning, M. Kistler, and C. Wright, Accelerating computing with the cell broadband engine processor, CF '08, pp.3-12, 2008.

V. Danjean, R. Namyst, and P. Wacrenier, An efficient multi-level trace toolkit for multi-threaded applications, EuroPar, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00360309

U. Dastgeer, J. Enmyren, and C. W. Kessler, Auto-tuning skepu: a multibackend skeleton programming framework for multi-gpu systems, Proceeding of the 4th international workshop on Multicore software engineering, pp.25-32, 2011.

J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, Communicationavoiding parallel and sequential qr factorizations, 2008.
DOI : 10.1137/080731992

URL : http://arxiv.org/abs/0808.2664

F. Gregory, S. Diamos, and . Yalamanchili, Harmony: an execution model and runtime for heterogeneous many core systems, HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pp.197-200, 2008.

R. Dolbeau, S. Bihan, and F. Bodin, HMPP: A hybrid multi-core parallel programming environment, 2007.

D. Dunning, G. Regnier, G. Mcalpine, D. Cameron, B. Shubert et al., The virtual interface architecture, pp.66-76, 1998.

A. Duran, R. Ferrer, M. Klemm, E. Bronis-de-supinski, and . Ayguad, A proposal for user-defined reductions in openmp, Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp.43-55, 2010.

A. E. Eichenberger, K. O. Brien, K. O. Brien, P. Wu, T. Chen et al., Optimizing compiler for the cell processor, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques , PACT '05, pp.161-172, 2005.

H. El-rewini, T. G. Lewis, and H. H. Ali, Task scheduling in parallel and distributed systems, 1994.

J. Ellson, E. Gansner, L. Koutsofios, S. North, and G. Woodhull, Short Description, and Lucent Technologies. Graphviz open source graph drawing tools, In Lecture Notes in Computer Science, pp.483-484, 2001.

J. Enmyren and C. W. Kessler, Skepu: a multi-backend skeleton programming library for multi-gpu systems, Proceedings of the fourth international workshop on High-level parallel programming and applications, pp.5-14, 2010.

N. Farooqui, A. Kerr, G. Diamos, S. Yalamanchili, and K. Schwan, A framework for dynamically instrumenting gpu compute applications within gpu ocelot, Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pp.1-9, 2011.

J. Farrugia, P. Horain, E. Guehenneux, and Y. Alusse, Gpucv: A framework for image processing acceleration with graphics processors, Multimedia and Expo IEEE International Conference on, pp.585-588, 2006.

K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. Reiter-horn et al., Sequoia: Programming the memory hierarchy, Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.

M. Fatica, Accelerating linpack with cuda on heterogenous clusters, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp.46-51, 2009.

P. Ferraro, P. Hanna, L. Imbert, and T. Izard, Accelerating query-by-humming on GPU, Proceedings of the 10th International Society for Music Information Retrieval Conference (IS- MIR'09), pp.279-284, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00407932

A. Intel, New Frontiers in Performance Improvements and En- ergy Efficiency. http://software.intel.com/en-us/articles/ intel-avx-newfrontiers-in-performance-improvements-and-energyefficiency

. High-performance and . Forum, High performance fortran language specification, 1993.

M. Frigo and S. G. Johnson, FFTW: An adaptive software architecture for the FFT, Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, pp.1381-1384, 1998.

M. Frigo, C. E. Leiserson, and K. H. Randall, The implementation of the cilk-5 multithreaded language, SIGPLAN Not, vol.33, issue.5, pp.212-223, 1998.

F. Galilee, G. G. Cavalheiro, J. Roch, and M. Doreille, Athapascan-1: On-line building data flow graph in a parallel language, Parallel Architectures and Compilation Techniques Proceedings. 1998 International Conference on, pp.88-95, 1998.

A. Geist, W. Gropp, S. Huss-lederman, A. Lumsdaine, E. Lusk et al., Mpi-2: Extending the message-passing interface, Euro-Par'96 Parallel BIBLIOGRAPHY Processing, pp.128-135, 1996.

I. Gelado, J. Cabezas, J. E. Stone, S. Patel, N. Navarro et al., An asymmetric distributed shared memory model for heterogeneous parallel systems, International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'10), 2010.

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro et al., An asymmetric distributed shared memory model for heterogeneous parallel systems, SIGARCH Comput. Archit. News, vol.38, pp.347-358, 2010.

L. Genovese, M. Ospici, T. Deutsch, J. Méhaut, A. Neelov et al., Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures, J Chem Phys, vol.131, issue.3, p.34103, 2009.

S. Ghiasi, T. Keller, and F. Rawson, Scheduling for heterogeneous processors in server systems, Proceedings of the 2nd conference on Computing frontiers, CF '05, pp.199-210, 2005.

G. Project, Plugins -GNU Compiler Collection (GCC) Internals, 2011.

H. González-vélez and M. Leyton, A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers, Softw. Pract. Exper, vol.40, pp.1135-1160, 2010.

J. Greene, C. Nowacki, and M. Prelle, Pas: A parallel applications system for signal processing applications, International Conference on Signal Processing Applications and Technology, 1996.

C. Gregg, J. Brantley, and K. Hazelwood, Contention-aware scheduling of parallel code for heterogeneous systems, 2nd USENIX Workshop on Hot Topics in Parallelism, 2010.

C. Grelck, S. Scholz, and A. Shafarenko, A Gentle Introduction to S-Net: Typed Stream Processing and Declarative Coordination of Asynchronous Components, Parallel Processing Letters, vol.18, issue.2, pp.221-237, 2008.

D. Grewe and M. F. O-'boyle, A static task partitioning approach for heterogeneous systems using opencl, Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software, pp.286-305, 2011.

K. The and . Group, OpenCL -the open standard for parallel programming of heterogeneous systems

L. Gu, J. Siegel, and X. Li, Using gpus to compute large out-of-card ffts, Proceedings of the international conference on Supercomputing, pp.255-264, 2011.

J. Hammersley and D. C. Handscomb, Monte Carlo methods, 1964.

T. David, H. , and T. S. Abdelrahman, hicuda: a high-level directive-based language for gpu programming, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp.52-61, 2009.

W. Bingsheng-he, Q. Fang, N. K. Luo, T. Govindaraju, and . Wang, Mars: a mapreduce framework on graphics processors, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.260-269, 2008.

S. Henry, OpenCL as StarPU frontend

E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard, Multigpu and multi-cpu parallelization for interactive physics simulations, Euro-Par 2010 -Parallel Processing, pp.235-246, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00502448

C. A. Hoare, Communicating sequential processes, Commun. ACM, vol.21, pp.666-677, 1978.

T. Hoefler, P. Kambadur, R. Graham, G. Shipman, and A. Lumsdaine, A Case for Standard Non-blocking Collective Operations, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.125-134, 2007.

S. Hong and H. Kim, An integrated gpu power and performance model, SIGARCH Comput. Archit. News, vol.38, pp.280-289, 2010.

C. Huang, O. Lawlor, and L. Kal, Adaptive MPI, Languages and Compilers for Parallel Computing, pp.306-322, 2004.

J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis, CULA: hybrid GPU accelerated linear algebra routines, Modeling and Simulation for Defense Systems and Applications V, 2010.
DOI : 10.1117/12.850538

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.456.478

I. Corp, http://software.intel.com/en-us/ data-parallel/. [103] Intel Corp. Intel Threading Building Blocks, Intel Ct Technology

T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard et al., Automatic cpu-gpu communication management and optimization, Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation , PLDI '11, pp.142-151, 2011.
DOI : 10.1145/2345156.1993516

J. Víctor, L. Jiménez, I. Vilanova, M. Gelado, G. Gil et al., Predictive Runtime Code Scheduling for Heterogeneous Architectures, HiPEAC, pp.19-33, 2009.

L. V. Kalé, B. Ramkumar, A. B. Sinha, and V. A. Saletore, The CHARM Parallel Programming Language and System: Part II ? The Runtime system, 1994.

V. Laxmikant, S. Kale, and . Krishnan, Charm++: a portable concurrent object oriented system based on c++, Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications, OOPSLA '93, pp.91-108, 1993.

V. Laxmikant, D. M. Kale, L. Kunzman, and . Wesolowski, Accelerator Support in the Charm++ Parallel Programming Model, Scientific Computing with Multicore and Accelerators, pp.393-412, 2011.

J. Chassin-de-kergommeaux, M. Benhur-de-oliveira-stein, and . Martin, Paj??: An Extensible Environment for Visualizing Multi-threaded Programs Executions, Proc. Euro-Par, pp.133-144, 1900.
DOI : 10.1007/3-540-44520-X_17

J. Kim, H. Kim, J. Hwan-lee, and J. Lee, Achieving a single compute device image in opencl for multiple gpus, Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, pp.277-288, 2011.

V. Volodymyr, R. J. Kindratenko, and . Brunner, Accelerating cosmological data analysis with fpgas, Programmable Custom Computing Machines, Annual IEEE Symposium on, pp.11-18, 2009.

A. Kï-ockner, N. Pinto, Y. Lee, B. C. Catanzaro, P. Ivanov et al., Pycuda: Gpu run-time code generation for high-performance computing, 2009.

K. Komatsu, K. Sato, Y. Arai, K. Koyama, H. Takizawa et al., Evaluating performance and portability of opencl programs, The Fifth International Workshop on Automatic Performance Tuning, 2010.

M. Kudlur and S. Mahlke, Orchestrating the execution of stream programs on multicore platforms, ACM SIGPLAN Notices, vol.43, issue.6, pp.114-124, 2008.
DOI : 10.1145/1379022.1375596

D. Kunzman, Charm++ on the Cell Processor Master's thesis, 2006.

J. Kurzak, A. Buttari, and J. Dongarra, Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization, IEEE Transactions on Parallel and Distributed Systems, vol.19, issue.9, pp.1175-1186, 2008.
DOI : 10.1109/TPDS.2007.70813

J. Kurzak, A. Buttari, and J. Dongarra, Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization, IEEE Transactions on Parallel and Distributed Systems, vol.19, issue.9, pp.1175-1186, 2008.
DOI : 10.1109/TPDS.2007.70813

J. Kurzak and J. Dongarra, Implementation of the mixed-precision high performance, LINPACK Benchmark on the CELL Processor, 2006.

J. Kurzak and J. Dongarra, Implementing Linear Algebra Routines on Multi-core Processors with Pipelining and a Look Ahead, Applied Parallel Computing. State of the Art in Scientific Computing, pp.147-156, 2007.
DOI : 10.1007/978-3-540-75755-9_18

J. Kurzak, R. Nath, P. Du, and J. Dongarra, An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs, 2010.
DOI : 10.1016/S0167-8191(00)00087-9

L. Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Transactions on Computers, vol.28, issue.9, pp.690-691, 1979.
DOI : 10.1109/TC.1979.1675439

O. S. Lawlor, Message passing for GPGPU clusters: CudaMPI. In Cluster Computing and Workshops, CLUSTER '09. IEEE International Conference on, pp.1-8, 2009.
DOI : 10.1109/clustr.2009.5289129

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.539.3749

F. Lecron, S. Ahmed-mahmoudi, M. Benjelloun, S. Mahmoudi, and P. Manneback, Heterogeneous Computing for Vertebra Detection and Segmentation in X-Ray Images, International Journal of Biomedical Imaging, vol.10, issue.2, 2011.
DOI : 10.1007/s11548-008-0149-1

J. Lee, S. Seo, C. Kim, J. Kim, P. Chun et al., COMIC, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.303-314, 2008.
DOI : 10.1145/1454115.1454157

A. E. Lefohn, S. Sengupta, J. Kniss, R. Strzodka, and J. D. Owens, Glift, ACM Transactions on Graphics, vol.25, issue.1, pp.60-99, 2006.
DOI : 10.1145/1122501.1122505

D. Leijen, W. Schulte, and S. Burckhardt, The design of a task parallel library, ACM SIGPLAN Notices, vol.44, issue.10, pp.227-242, 2009.
DOI : 10.1145/1639949.1640106

Y. Li, J. Dongarra, and S. Tomov, A Note on Auto-tuning GEMM for GPUs, Proceeding of ICCS'09, 2009.
DOI : 10.1007/978-3-642-01970-8_89

M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng, Merge, ACM SIGPLAN Notices, vol.43, issue.3, pp.287-296, 2008.
DOI : 10.1145/1353536.1346318

M. David-linderman, A programming model and processor architecture for heterogeneous multicore computers, p.3351459, 2009.

H. Ltaief, S. Tomov, R. Nath, P. Du, and J. Dongarra, A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators, Proceedings of the 9th international conference on High performance computing for computational science, VEC- PAR'10, pp.93-101, 2011.
DOI : 10.1007/978-3-642-03869-3_79

C. Luk, S. Hong, and H. Kim, Qilin, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.45-55, 2009.
DOI : 10.1145/1669112.1669121

F. Sidi-ahmed-mahmoudi, P. Lecron, M. Manneback, S. Benjelloun, and . Mahmoudi, GPU-Based Segmentation of Cervical Vertebra in X-Ray Images, IEEE International Conference on Cluster Computing, Crete. Greece, 2010.

W. R. Mark, R. Steven-glanville, K. Akeley, and M. J. Kilgard, Cg: a system for programming graphics hardware in a c-like language, SIGGRAPH '03: ACM SIGGRAPH 2003 Papers, pp.896-907, 2003.

T. G. Mattson, R. Van-der-wijngaart, and M. Frumkin, Programming the Intel 80-core network-on-a-chip Terascale Processor, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-38, 2008.
DOI : 10.1109/SC.2008.5213921

M. D. Mccool, Data-parallel programming on the cell be and the gpu using the rapidmind development platform, 2006.

P. Mccormick, J. Inman, J. Ahrens, J. Mohd-yusof, G. Roth et al., Scout: a data-parallel programming language for graphics processors, Parallel Computing, vol.33, issue.10-11, pp.10-11648, 2007.
DOI : 10.1016/j.parco.2007.09.001

S. Moreaud, B. Goglin, and R. Namyst, Adaptive MPI Multirail Tuning for Non-uniform Input/Output Access, Lecture Notes in Computer Science, vol.6305, pp.239-248, 2010.
DOI : 10.1007/978-3-642-15646-5_25

URL : https://hal.archives-ouvertes.fr/inria-00486178

M. Corp, Movidius -The Mobile Video Processor Company

R. Namyst and J. Mhaut, Marcel : Une bibliothque de processus lgers, 1995.

R. Kumar and N. , Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach, 2010.

M. Nijhuis, H. Bos, and H. E. Bal, A Component-based Coordination Language for Efficient Reconfigurable Streaming Applications, 2007 International Conference on Parallel Processing (ICPP 2007), p.60, 2007.
DOI : 10.1109/ICPP.2007.5

N. Corp, Fermi Compute Architecture White Paper

N. Corp, Nvidia npp library

Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka, An efficient, model-based CPU-GPU heterogeneous FFT library, Parallel and Distributed Processing, pp.1-10, 2008.

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krügerkr¨krüger et al., A Survey of General-Purpose Computation on Graphics Hardware, Computer Graphics Forum, vol.7, issue.4, pp.80-113, 2007.
DOI : 10.1016/j.rti.2005.04.002

H. Pan, B. Hindman, and K. Asanovi´casanovi´c, Composing parallel software efficiently with lithe, Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI '10, pp.376-387, 2010.
DOI : 10.1145/1806596.1806639

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.2385

A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong et al., FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, 2009 IEEE 7th Symposium on Application Specific Processors, pp.35-42, 2009.
DOI : 10.1109/SASP.2009.5226333

F. Pellegrini and J. Roman, Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs, High-Performance Computing and Networking, pp.493-498, 1996.
DOI : 10.1007/3-540-61142-8_588

A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, HPL -A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers -Version 2, 2008.

D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee et al., The design and implementation of a first-generation cell processor -a multi-core soc, Integrated Circuit Design and Technology, 2005. ICICDT 2005. 2005 International Conference on, pp.49-52, 2005.

J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta, Hierarchical Task-Based Programming With StarSs, International Journal of High Performance Computing Applications, vol.23, issue.3, p.284, 2009.
DOI : 10.1177/1094342009106195

A. Pop and A. Cohen, A stream-computing extension to OpenMP, Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, HiPEAC '11, pp.5-14, 2011.
DOI : 10.1145/1944862.1944867

URL : https://hal.archives-ouvertes.fr/hal-00659411

F. Puglisi, R. Ridi, F. Cecchi, A. Bonelli, and R. Ferrari, Segmental vertebral motion in the assessment of neck range of motion in whiplash patients, International Journal of Legal Medicine, vol.118, issue.4, pp.235-244, 2004.
DOI : 10.1007/s00414-004-0462-3

M. Puschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso et al., SPIRAL: Code Generation for DSP Transforms, Proceedings of the IEEE, pp.232-275, 2005.
DOI : 10.1109/JPROC.2004.840306

B. Putigny, Optimisation de code sur processeur Cell Master's thesis, Laboratoire PRISM, 2009.

R. Rabenseifner, G. Hager, and G. Jost, Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. Parallel, Distributed, and Network-Based Processing, Euromicro Conference on, vol.0, pp.427-436, 2009.

T. Vignesh, W. Ravi, D. Ma, G. Chiu, and . Agrawal, Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations, Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pp.137-146, 2010.

C. Martin, D. J. Rinard, M. S. Scales, and . Lam, Jade: A high-level, machineindependent language for parallel programming, Computer, vol.26, pp.28-38, 1993.

D. W. Roeh, V. Volodymyr, R. J. Kindratenko, and . Brunner, Accelerating cosmological data analysis with graphics processors, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pp.1-8, 2009.
DOI : 10.1145/1513895.1513896

S. Rosario-torres and M. Velez-reyes, Speeding up the matlab hyperspectral image analysis toolbox using gpus and the jacket toolbox, Hyperspectral Image and Signal Processing: Evolution in Remote Sensing WHISPERS '09. First Workshop on, pp.1-4, 2009.

R. J. Rost, OpenGL(R) Shading Language, 2005.

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash et al., Larrabee: a many-core x86 architecture for visual computing, ACM Trans. Graph, vol.27, pp.181-1815, 2008.

K. Shirahata, H. Sato, and S. Matsuoka, Hybrid map task scheduling for gpubased heterogeneous clusters, Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, CLOUDCOM '10, pp.733-740, 2010.
DOI : 10.1109/cloudcom.2010.55

F. Smailbegovic and N. Georgi, Gaydadjiev, and Stamatis Vassiliadis. Sparse matrix storage format

F. Song, A. Yarkhan, and J. Dongarra, Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pp.1-19, 2009.
DOI : 10.1145/1654059.1654079

K. Spafford, J. S. Meredith, and J. S. Vetter, Quantifying NUMA and contention effects in multi-GPU systems, Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp.1-11, 2011.
DOI : 10.1145/1964179.1964194

J. Stratton, S. Stone, and W. Hwu, MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, Languages and Compilers for Parallel Computing, pp.16-30, 2008.
DOI : 10.1007/978-3-540-89740-8_2

T. Suganuma, H. Komatsu, and T. Nakatani, Detection and global optimization of reduction operations for distributed parallel machines, Proceedings of the 10th international conference on Supercomputing , ICS '96, pp.18-25, 1996.
DOI : 10.1145/237578.237581

M. Technologies and . Gpudirect, Technology Accelerating GPU-based Systems, BIBLIOGRAPHY the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pp.13-24, 2010.

P. The and . Group, PGI Fortran & C Accelerator Programming Model white paper

P. The and . Group, Pgi cuda-x86, 2011.

S. Thibault, R. Namyst, and P. Wacrenier, Building Portable Thread Schedulers for Hierarchical Multiprocessors: The BubbleSched Framework, Proceedings of the 13th International Euro-par Conference, 2007.
DOI : 10.1007/978-3-540-74466-5_6

URL : https://hal.archives-ouvertes.fr/inria-00154506

S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp.1-8, 2010.
DOI : 10.1109/IPDPSW.2010.5470941

H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing. Parallel and Distributed Systems, IEEE Transactions on, vol.13, issue.3, pp.260-274, 2002.

F. Trahay and A. Denis, A scalable and generic task scheduling system for communication libraries, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289169

URL : https://hal.archives-ouvertes.fr/inria-00408521

G. Tzenakis, K. Kapelonis, M. Alvanos, K. Koukos, D. S. Nikolopoulos et al., Tagged Procedure Calls (TPC): Efficient Runtime Support for Task-Based Parallelism on the Cell Processor, HiPEAC, pp.307-321, 2010.
DOI : 10.1007/978-3-642-11515-8_23

S. Tzeng, A. Patney, and J. D. Owens, Poster: Task management for irregular workloads on the gpu, Proceeding of NVIDIA GPU Technology Conference, 2010.

J. D. Valois, Implementing lock-free queues, Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, pp.64-69, 1994.

R. F. Van-der-wijngaart, T. G. Mattson, and W. Haas, Light-weight communications on Intel's single-chip cloud computer processor, ACM SIGOPS Operating Systems Review, vol.45, issue.1, pp.73-83, 2011.
DOI : 10.1145/1945023.1945033

S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson et al., An 80-tile sub-100-w teraflops processor in 65-nm cmos. Solid-State Circuits, IEEE Journal, vol.43, issue.1, pp.29-41, 2008.

J. S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis et al., Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community, Computing in Science & Engineering, vol.13, issue.5, pp.90-95, 2011.
DOI : 10.1109/MCSE.2011.83

V. Volkov and J. Demmel, Benchmarking GPUs to tune dense linear algebra, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008.
DOI : 10.1109/SC.2008.5214359

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.218.3436

R. Vuduc, J. Demmel, and K. Yelick, OSKI: A library of automatically tuned sparse matrix kernels, Proc. of SciDAC'05, 2005.
DOI : 10.1088/1742-6596/16/1/071

L. Wesolowski, An Application Programming Interface for General Purpose Graphics Processing Units in an Asynchronous Runtime System. Master's thesis, 2008.

R. , C. Whaley, A. Petitet, and J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Computing, vol.27, issue.12, pp.3-35, 2001.

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008.
DOI : 10.1109/IPDPS.2008.4536295

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, Fermi GF100 GPU Architecture, IEEE Micro, vol.31, issue.2, pp.50-59, 2011.
DOI : 10.1109/MM.2011.24

L. Wu, C. Weaver, and T. Austin, Cryptomaniac: a fast flexible architecture for secure communication, Computer Architecture Proceedings. 28th Annual International Symposium on, pp.110-119, 2001.

A. Yarkhan, J. Kurzak, and J. Dongarra, Quark users' guide: Queueing and runtime for kernels

B. Zhang, S. Xu, F. Zhang, Y. Bi, and L. Huang, Accelerating matlab code using gpu: A review of tools and strategies, Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC) 2nd International Conference on, pp.1875-1878, 2011.

D. Appendix, . Publications, C. Agullo, J. Augonnet, H. Dongarra et al., Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators, Symposium on Application Accelerators in High Performance Computing (SAAHPC), 2010.

C. Agullo, J. Augonnet, H. Dongarra, R. Ltaief, S. Namyst et al., A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs, GPU Computing Gems, 2010.
DOI : 10.1016/B978-0-12-385963-1.00034-4

C. Agullo, J. Augonnet, M. Dongarra, J. Faverge, and . Langou, Hatem Ltaief, and Stanimire Tomov. LU factorization for accelerator-based systems, 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), 2011.

C. Agullo, J. Augonnet, M. Dongarra, H. Faverge, S. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, p.2011
DOI : 10.1109/IPDPS.2011.90

URL : https://hal.archives-ouvertes.fr/inria-00547614

C. Augonnet, J. Clet-ortega, S. Thibault, and R. Namyst, Data-Aware Task Scheduling on Multi-accelerator Based Platforms, 2010 IEEE 16th International Conference on Parallel and Distributed Systems, 2010.
DOI : 10.1109/ICPADS.2010.129

URL : https://hal.archives-ouvertes.fr/inria-00523937

[. Augonnet and R. Namyst, A Unified Runtime System for Heterogeneous Multi-core Architectures, Proceedings of the International Euro-Par Workshops, pp.174-183, 2008.
DOI : 10.1111/j.1467-8659.2007.01012.x

[. Augonnet, S. Thibault, and R. Namyst, Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures, Proceedings of BIBLIOGRAPHY the International Euro-Par Workshops, pp.56-65, 2009.
DOI : 10.1007/978-3-642-14122-5_9

URL : https://hal.archives-ouvertes.fr/inria-00421333

[. Augonnet, S. Thibault, and R. Namyst, StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00467677

C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis, Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System, SAMOS Workshop -International Workshop on Systems, Architectures, Modeling, and Simulation, 2009.
DOI : 10.1007/978-3-642-03138-0_36

URL : https://hal.archives-ouvertes.fr/inria-00378705

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures, Proceedings of the 15th International Euro-Par Conference, pp.863-874, 2009.
DOI : 10.1111/j.1467-8659.2007.01012.x

URL : https://hal.archives-ouvertes.fr/inria-00384363

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures . Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

C. Augonnet, Vers des supports d'exécution capables d'exploiter les machines multicoeurs hétérogènes, 2008.

A. Augonnet, StarPU: un support exécutif unifié pour les architectures multicoeurs hétérogènes, 19èmes Rencontres Francophones du Parallélisme, 2009.

S. Benkner, S. Pllana, P. Jesper-larsson-träff, U. Tsigas, C. Dolinsky et al., PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems, IEEE Micro, vol.31, issue.5, pp.3128-3169, 2011.
DOI : 10.1109/MM.2011.67

URL : https://hal.archives-ouvertes.fr/hal-00648480

M. Nijhuis, H. Bos, H. E. Bal, and C. Augonnet, Mapping and Synchronizing Streaming Applications on Cell Processors, HiPEAC, pp.216-230, 2009.
DOI : 10.1007/978-3-540-92990-1_17

URL : https://hal.archives-ouvertes.fr/inria-00445993

]. Web and . Inria-runtime-team-website, Starpu: A unified runtime system for heterogeneous multicore architectures