M. Kawazoe-aguilera, W. Chen, and S. Toueg, Heartbeat: A timeout-free failure detector for quiescent reliable communication, Marios Mavronicolas and Philippas Tsigas Proceedings of the 11th Workshop on Distributed Algorithms (WDAG'97, pp.126-140, 1997.
DOI : 10.1007/BFb0030680

R. Alfieri, R. Cecchini, V. Ciaschini, L. Dell-'agnello, Á. Frohner et al., VOMS, an Authorization System for Virtual Organizations, European Across Grids Conference, pp.33-40
DOI : 10.1007/978-3-540-24689-3_5

P. David, . Anderson, and . Boinc, A System for Public-Resource Computing and Storage, Proceedings of the 5th International Workshop on Grid Computing, pp.4-10, 2004.

T. Angskun, G. Bosilca, and J. Dongarra, Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology, Proceedings of the 5th International Symposium on Parallel and Distributed Processing and Applications, pp.471-482, 2007.
DOI : 10.1007/978-3-540-74742-0_43

T. Angskun, G. Fagg, and G. Bosilca, Jelena Pjesivac-Grbovic and Jack Dongarra. « Scalable fault tolerant protocol for parallel runtime environments, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI Users' Group Meeting (Eu- roPVM/MPI'06), pp.141-149, 2006.

C. Dorian, B. P. Arnold, and . Miller, « A Scalable Failure Recovery Model for Tree-based Overlay Networks

D. C. Arnold, G. D. Pack, and B. P. Miller, Tree-based overlay networks for scalable applications, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006.
DOI : 10.1109/IPDPS.2006.1639493

I. Emanouil, T. V. Atanassov, A. Gurov, M. Karaivanova, and . Nedjalkov, « Monte Carlo Grid Application for Electron Transport, Proceedings of the 6th International Conference on Computational Science (ICCS'06), Part III, pp.616-623, 2006.

O. Aumage, L. Bougé, J. Méhaut, and R. Namyst, Madeleine II: a portable and efficient communication library for high-performance cluster computing, Proceedings IEEE International Conference on Cluster Computing. CLUSTER 2000, pp.607-626, 2002.
DOI : 10.1109/CLUSTR.2000.889004

A. Avizienis, J. Laprie, and B. Randell, Dependability and Its Threats: A Taxonomy, Building the Information Society, IFIP 18th World Computer Congress, Topical Sessions, pp.22-27, 2004.
DOI : 10.1007/978-1-4020-8157-6_13

A. Avizienis, J. Laprie, B. Randell, and C. E. Landwehr, Basic concepts and taxonomy of dependable and secure computing, Basic Concepts and Taxonomy of Dependable and Secure Computing, pp.11-33, 2004.
DOI : 10.1109/TDSC.2004.2
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.219.5446

R. M. Badia, D. Du, E. Huedo, A. Kokossis, I. M. Llorente et al., Integration of GRID Superscalar and GridWay Metascheduler with the DRMAA OGF Standard, Proceedings of the 14th European Conference on Parallel and Distributed Computing, pp.445-455, 2008.
DOI : 10.1007/978-3-540-85451-7_49

P. Bar, C. Coti, D. Groen, T. Herault, V. Kravtsov et al., Running Parallel Applications with Topology-Aware Grid Middleware, 2009 Fifth IEEE International Conference on e-Science, 2009.
DOI : 10.1109/e-Science.2009.48
URL : https://hal.archives-ouvertes.fr/hal-00684522

M. Beck, J. Dongarra, and J. S. Plank, NetSolve/D: A Massively Parallel Grid Execution System for Scalable Data Intensive Collaboration, 19th IEEE International Parallel and Distributed Processing Symposium, 2005.
DOI : 10.1109/IPDPS.2005.298

D. Bonachea and J. Duell, Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations, International Journal of High Performance Computing and Networking, vol.1, issue.1/2/3, pp.91-99, 2004.
DOI : 10.1504/IJHPCN.2004.007569

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fédak et al., Frédéric Magniette, Vincent Néri and Anton Selikhov. « MPICH-V : Toward a Scalable Fault Tolerant MPI for Volatile Nodes, High Performance Networking and Computing (SC2002), 2002.

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

F. Bouabache and T. Hérault, Gilles Fedak and Franck Cappello. « Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment, Proceedings of the 8th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'08), pp.475-483, 2008.

D. Salah-salim-boutammine, C. Millot, and . Parrot, An Adaptive Scheduling Method for Grid Computing, Proceedings of the 12th European Conference on Parallel and Distributed Computing, pp.188-197, 2006.
DOI : 10.1007/11823285_20

A. «. Bouteiller, Tolérance automatique aux défaillances par points de reprise et retour en arrière dans les systèmes hautes performances à passage de messages ». Doctorat en sciences, spécialité informatique, 2006.

A. Bouteiller, G. Bosilca, and J. Dongarra, Redesigning the message logging model for high performance, International Supercomputer Conference, 2008.
DOI : 10.1002/cpe.1589
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.161.3871

A. Bouteiller, F. Cappello, T. Hérault, and G. Krawezik, Pierre Lemarinier and Frédéric Magniette. « MPICH-V2 : a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging, High Performance Networking and Computing (SC2003). Phoenix USA, 2003.

A. Bouteiller, B. Collin, and T. Herault, Pierre Lemarinier and Franck Cappello. « Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI, Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), p.97, 2005.

A. Bouteiller and P. Lemarinier, Géraud Krawezik and Franck Cappello. « Coordinated checkpoint versus message log for fault tolerant MPI, IEEE International Conference on Cluster Computing, 2003.

A. Bouteiller and P. Lemarinier, Géraud Krawezik and Franck Cappello. « Coordinated checkpoint versus message log for fault tolerant MPI », International Journal of High Performance Computing and Networking (IJHPCN), issue.3, 2004.

S. Branford, C. Sahin, A. Thandavan, C. Weihrauch, and N. Vassil, Monte Carlo methods for matrix computations on the grid, Future Generation Computer Systems, vol.24, issue.6, pp.605-612, 2008.
DOI : 10.1016/j.future.2007.07.006

J. Bruck, C. Ho, S. Kipnis, E. Upfal, and D. Weathersby, Efficient algorithms for all-to-all communications in multiport message-passing systems, IEEE Transactions on Parallel and Distributed Systems, vol.8, issue.11, pp.1143-1156, 1997.
DOI : 10.1109/71.642949

D. Buntinas, C. Coti, T. Herault, P. Lemarinier, L. Pilard et al., Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols, Digital Object Identifier, pp.73-84, 2008.
DOI : 10.1016/j.future.2007.02.002
URL : https://hal.archives-ouvertes.fr/hal-00688644

D. Buntinas, G. Mercier, and W. D. Gropp, « Design and Evaluation of Nemesis : a Scalable, Low-Latency, Message-Passing Communication Subsystem, Proceedings of the 6th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006.

G. Burns, R. Daoud, J. «. Vaigl, and . Lam, An Open Cluster Environment for MPI, Proceedings of Supercomputing Symposium, pp.379-386, 1994.

R. M. Butler, W. D. Gropp, and E. L. Lusk, A Scalable Process-Management Environment for Parallel Programs, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 7th European PVM/MPI Users' Group Meeting (EuroPVM/MPI'02), pp.168-175, 2000.
DOI : 10.1007/3-540-45255-9_25
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.7667

N. Capit, G. D. Costa, Y. Georgiou, G. Huard, C. Martin et al., A batch scheduler with high level components, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., pp.776-783, 2005.
DOI : 10.1109/CCGRID.2005.1558641
URL : https://hal.archives-ouvertes.fr/hal-00005106

F. Cappello, E. Caron, M. Dayde, F. Desprez, Y. Jegou et al., Grid'5000: a large scale and highly reconfigurable grid experimental testbed, The 6th IEEE/ACM International Workshop on Grid Computing, 2005., pp.99-106, 2005.
DOI : 10.1109/GRID.2005.1542730
URL : https://hal.archives-ouvertes.fr/hal-00684943

F. Cappello, P. Fraigniaud, B. Mans, A. L. Rosenberg, and . Hihcohp, HiHCoHP-Toward a realistic communication model for hierarchical hyperclusters of heterogeneous processors, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, p.42, 2001.
DOI : 10.1109/IPDPS.2001.924978

D. Caromel and C. Delbe, Alexandre Di Costanzo and Mario Leyton. « ProActive : an Integrated platform for programming and running applications on grids and P2P systems, 2006.

H. Casanova and J. Dongarra, NetSolve, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '96, pp.212-223, 2000.
DOI : 10.1145/369028.369111

R. H. Castain, T. S. Woodall, D. J. Daniel, J. M. Squyres, B. Barrett et al., The Open Run-Time Environment (OpenRTE): A Transparent Multi-cluster Environment for High-Performance Computing, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 12th European PVM/MPI Users' Group Meeting, pp.225-232, 2005.
DOI : 10.1007/11557265_31

M. Cecchi, F. Capannini, A. Dorigo, A. Ghiselli, F. Giacomini et al., Luca Petronzio and Francesco Prelz. « The gLite Workload Management System, Proceedings of the 4th Onternational Conference on Advances in Grid and Pervasive Computing, pp.256-268, 2009.

C. Cérin, J. Dubacq, and J. Roch, Methods for Partitioning Data to Improve Parallel Execution Time for Sorting on Heterogeneous Clusters, Lecture Notes in Computer Science, vol.3947, pp.175-186
DOI : 10.1007/11745693_18

K. , M. Chandy, and L. Lamport, « Distributed Snapshots : Determining Global States of Distributed Systems, Transactions on Computer Systems, pp.63-75, 1985.

Q. Chen and M. C. Ferris, FATCOP: A Fault Tolerant Condor-PVM Mixed Integer Programming Solver, SIAM Journal on Optimization, vol.11, issue.4, 2001.
DOI : 10.1137/S1052623499353911

Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun et al., Fault tolerant high performance computing by a coding approach, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '05, pp.213-223, 2005.
DOI : 10.1145/1065944.1065973

J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov et al., « ScaLAPACK : A Portable Linear Algebra Library for Distributed Memory Computers -Design Issues and Performance, PARA, pp.95-106, 1995.

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. W. Walker et al., A proposal for a set of parallel basic linear algebra subprograms, pp.107-114, 1995.
DOI : 10.1007/3-540-60902-4_13

G. Benoit-claudel, O. Huard, and . Richard, « TakTuk, adaptive deployment of remote executions, Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, pp.91-100, 2009.

C. , C. Coarfa, Y. Dotsenko, J. Mellor-crummey, D. Chavarria-miranda et al., Tarek El-Ghazawi, Ashrujit Mohanti and Yiyi Yao. « An Evaluation of Global Address Space Languages : Co-Array Fortran and Unified Parallel, 2005.

D. Conan, « Tolérance aux fautes par recouvrement arrière dans les systèmes informatiques répartis ». Doctorat en sciences, spécialité informatique, 1996.

M. Cooke, « Silicon transistor hits 500GHz performance ». III-Vs Review, pp.30-31, 2006.
DOI : 10.1016/s0961-1290(06)71713-6
URL : http://doi.org/10.1016/s0961-1290(06)71713-6

C. Coti, T. Herault, and F. Cappello, MPI Applications on Grids: A Topology Aware Approach, 2008.
DOI : 10.1007/978-3-540-24685-5_1
URL : https://hal.archives-ouvertes.fr/inria-00319241

C. Coti, T. Herault, and F. Cappello, MPI Applications on Grids: A Topology Aware Approach, Proceedings of the 15th European Conference on Parallel and Distributed Computing (EuroPar'09), pp.466-477, 2009.
DOI : 10.1007/978-3-540-24685-5_1
URL : https://hal.archives-ouvertes.fr/inria-00319241

C. Coti, T. Herault, D. Groen, M. «. Mamonski, and . D1, Adapted Version of the OpenMPI Communication Library, 2009.

C. Coti, T. Herault, and P. Lemarinier, Sylvain Peyronnet and Ala Rezmerita. « D1.2a : OpenMPI Communication Library, 2007.

C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita et al., Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI, ACM/IEEE SC 2006 Conference (SC'06), p.page electronic, 2006.
DOI : 10.1109/SC.2006.15
URL : https://hal.archives-ouvertes.fr/hal-00684891

C. Coti and T. Herault, Sylvain Peyronnet, Ala Rezmerita and Franck Cappello. « Grid Services For MPI, Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid'08), pp.417-424, 2008.

C. Coti, T. Herault, A. «. Rezmerita, and . D1, Adapted Version of the OpenMPI Communication Library, 2008.

C. Coti and A. Rezmerita, Thomas Herault and Franck Cappello. « Grid Services For MPI, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI Users' Group Meeting (EuroPVM/M- PI'07), pp.393-394, 2007.

W. Dally and C. L. Seitz, Deadlock-Free Message Routing in Multiprocessor Interconnection Networks, IEEE Transactions on Computers, vol.36, issue.5, pp.36547-553, 1987.
DOI : 10.1109/TC.1987.1676939
URL : http://authors.library.caltech.edu/26907/1/5206-TR-86.pdf

V. Danjean, R. Gillard, S. Guelton, J. Roch, and T. Roche, Adaptive loops with kaapi on multicore and grid, Proceedings of the 2007 international workshop on Parallel symbolic computation, PASCO '07, pp.33-42, 2007.
DOI : 10.1145/1278177.1278185

J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, « Communication-avoiding parallel and sequential QR factorizations, 2008.
DOI : 10.1137/080731992
URL : http://arxiv.org/abs/0808.2664

A. Denis, O. Aumage, R. F. Hofman, K. Verstoep, T. Kielmann et al., Wide-area communication for grids: an integrated solution to connectivity, performance and security problems, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004., pp.97-106, 2004.
DOI : 10.1109/HPDC.2004.1323501
URL : https://hal.archives-ouvertes.fr/inria-00000126

A. Denis, C. Pérez, and T. Priol, PadicoTM: an open integration framework for communication middleware and runtimes, Future Generation Computer Systems, vol.19, issue.4, pp.575-585, 2003.
DOI : 10.1016/S0167-739X(03)00034-7
URL : https://hal.archives-ouvertes.fr/inria-00000132

S. Dolev, Self Stabilization, Journal of Aerospace Computing, Information, and Communication, vol.1, issue.6, 2000.
DOI : 10.2514/1.10141
URL : https://hal.archives-ouvertes.fr/inria-00627780

E. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

E. Graham, J. Fagg, and . Dongarra, « FT-MPI : Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World, 2000.

E. Graham, J. J. Fagg, and . Dongarra, « HARNESS fault tolerant MPI design, usage and performance issues, Future Generation Computer Systems, vol.18, issue.8, pp.1127-1142, 2002.

J. Michael and . Flynn, « Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput, vol.21, issue.9, pp.948-960, 1972.

B. Ford, P. Srisuresh, and D. Kegel, « Peer-to-Peer Communication Across Network Address Translators, USENIX Annual Technical Conference, General Track (USENIX '05), pp.179-192, 2006.

I. Foster and N. Karonis, A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems, Proceedings of the IEEE/ACM SC98 Conference, 1998.
DOI : 10.1109/SC.1998.10051

I. Foster, C. Kesselman, and S. Tuecke, « The Nexus Task-parallel Runtime System, Proc. 1st Intl Workshop on Parallel Processing, pp.457-462, 1994.

I. T. Foster, « What is the Grid ? A Three Point Checklist, 2002.

I. T. Foster, Globus Toolkit Version 4: Software for Service-Oriented Systems, Journal of Computer Science and Technology, vol.10, issue.2, pp.513-520, 2006.
DOI : 10.1007/s11390-006-0513-y
URL : http://doi.org/10.1007/s11390-006-0513-y

F. Gallilée, J. Roch, G. H. Gerson, M. Cavalheiro, and . Doreille, Athapascan-1: On-line building data flow graph in a parallel language, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192), pp.88-95, 1998.
DOI : 10.1109/PACT.1998.727176

T. Gautier, X. Besseron, L. «. Pigeon, and . Kaapi, A thread scheduling runtime system for data flow computations on cluster of multi-processors, Proceedings of the International Workshop on Parallel Symbolic Computing (PASCO'07), pp.15-23, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00684843

S. Genaud, A. Giersch, and F. Vivien, Load-balancing scatter operations for grid computing, Parallel Computing, vol.30, issue.8, pp.923-946, 2004.
DOI : 10.1016/j.parco.2004.07.005
URL : https://hal.archives-ouvertes.fr/hal-00807380

S. Genaud and C. Rattanapoka, Fault Management in P2P-MPI, Proceedings of Advances in Grid and Pervasive Computing, Second International Conference, pp.64-77, 2007.
DOI : 10.1007/978-3-540-72360-8_6
URL : https://hal.archives-ouvertes.fr/inria-00529974

Y. Georgiou, J. Leduc, B. Videau, J. Peyrard, and O. Richard, A tool for environment deployment in clusters and light grids, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006.
DOI : 10.1109/IPDPS.2006.1639691
URL : https://hal.archives-ouvertes.fr/hal-00688748

H. Gene, C. F. Golub, and . Van-loan, Matrix Computations, 1989.

K. Goto and R. A. Van-de-geijn, High-performance implementation of the level-3 BLAS, ACM Transactions on Mathematical Software, vol.35, issue.1, 2008.
DOI : 10.1145/1377603.1377607

D. William, E. L. Gropp, and . Lusk, « The MPI communication library : its design and a portable implementation, Proceedings of the Scalable Parallel Libraries Conference, pp.160-165, 1993.

D. William, E. L. Gropp, and . Lusk, « MPICH working note : Creating a new MPICH device using the channel interface, 1995.

D. William, E. L. Gropp, and . Lusk, « Fault Tolerance in MPI Programs, 2004.

W. D. Gropp, L. Ewing, N. Lusk, A. Doss, and . Skjellum, A high-performance, portable implementation of the MPI message passing interface standard, Parallel Computing, vol.22, issue.6, pp.789-828, 1996.
DOI : 10.1016/0167-8191(96)00024-5

M. Grunberg, S. Genaud, and C. Mongenet, « Parallel Seismic Ray Tracing in a Global Earth Model », Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp.1151-1157, 2002.

M. Gudgin and M. Hadley, « SOAP Version 1.2 Message Normalization ». World Wide Web Consortium, Note NOTE-soap12, 2003.

E. Huedo, R. S. Montero, and I. M. Llorente, « The GridWay Framework for Adaptive Scheduling and Execution on Grids, Scalable Computing : Practice and Experience, pp.1-8, 2005.

J. Hursey, T. Mattox, and A. Lumsdaine, Interconnect agnostic checkpoint/restart in open MPI, Proceedings of the 18th ACM international symposium on High performance distributed computing, HPDC '09, pp.49-58, 2009.
DOI : 10.1145/1551609.1551619
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.177.5837

J. Hursey, J. M. Squyres, T. Mattox, and A. Lumsdaine, The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI, 2007 IEEE International Parallel and Distributed Processing Symposium, pp.1-8, 2007.
DOI : 10.1109/IPDPS.2007.370605

P. Husbands, C. Iancu, and K. A. Yelick, A performance analysis of the Berkeley UPC compiler, Proceedings of the 17th annual international conference on Supercomputing , ICS '03, pp.63-73, 2003.
DOI : 10.1145/782814.782825

E. R. , J. Duell, and P. Hargrove, « The Design and Implementation of Berkeley Lab's Linux Checkpoint, 2003.

M. A. Jette, A. B. Yoo, M. «. Grondona, and . Slurm, Simple Linux Utility for Resource Management, Proceedings of the 9th International Workshop on Job Scheduling Strategies for Parallel Processing, pp.44-60, 2003.

V. Laxmikant, S. Kalé, . «. Krishnan, and . Charm++, A Portable Concurrent Object Oriented System Based on C++, Proceedings of The International Conference on Object Oriented Programming, Systems, Languages and Applications (OOPSLA'93), pp.91-108, 1993.

T. Nicholas, . Karonis, R. Bronis, I. De-supinski, W. D. Foster et al., « Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance, 14th International Parallel and Distributed Processing Symposium (SPDP'2000), pp.377-386, 2000.

T. Nicholas, B. R. Karonis, I. T. Toonen, . Foster, and . Mpich-g2, A Grid-Enabled Implementation of the Message Passing Interface, 2002.

V. Kravtsov, D. Carmeli, W. Dubitzky, A. Orda, A. Schuster et al., Quasi-opportunistic Supercomputing in Grid Environments, Algorithms and Architectures for Parallel Processing, 8th International Conference Proceedings, volume 5022 of Lecture Notes in Computer Science, pp.233-244, 2008.
DOI : 10.1007/978-3-540-69501-1_24

K. Kurowski, B. Ludwiczak, J. Nabrzyski, A. Oleksiak, and J. Pukacki, Dynamic Grid Scheduling with Job Migration and Rescheduling in the GridLab Resource Management System, Scientific Programming, pp.263-273, 2004.
DOI : 10.1155/2004/892169

K. Kurowski, M. Mamonski, P. Grabowski, Y. Langlois, G. Mecheneau et al., Second Prototype and Integration of Grid Services Together with QoS-Aware Grid MW Providers, 2008.

S. Lacour, C. Pérez, and T. Priol, A Network Topology Description Model for Grid Application Deployment, Fifth IEEE/ACM International Workshop on Grid Computing, pp.61-68, 2004.
DOI : 10.1109/GRID.2004.2
URL : https://hal.archives-ouvertes.fr/inria-00070773

E. Laure and B. Jones, Enabling Grids for e-Science, 2008.
DOI : 10.1201/9781420067682-c3

P. Lemarinier, A. Bouteiller, and T. Herault, Géraud Krawezik and Franck Cappello . « Improved Message logging versus Improved coordinated checkpointing for fault tolerant MPI, IEEE International Conference on Cluster Computing, 2004.

P. Liu and D. Wang, « Reduction Optimization in Heterogeneous Cluster Environments, IPPS : 14th International Parallel Processing Symposium, pp.477-482, 2000.

C. Martin, « Déploiement et contrôle d'applications parallèles sur grappes de grandes tailles, 2003.

M. L. Massie, B. N. Chun, and D. E. Culler, The ganglia distributed monitoring system: design, implementation, and experience, Parallel Computing, vol.30, issue.7, pp.817-840, 2004.
DOI : 10.1016/j.parco.2004.04.001
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2889

M. Matsuda, T. Kudoh, and Y. Kodama, Ryousei Takano and Yutaka Ishikawa. « TCP Adaptation for MPI on Long-and-Fat Networks, Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER'05), pp.1-10, 2005.

P. K. Mckinley, Y. Tsai, and D. F. Robinson, Collective communication in wormhole-routed massively parallel computers, Collective Communication in Wormhole- Routed Massively Parallel Computers, pp.39-50, 1995.
DOI : 10.1109/2.476198

H. Nakada, S. Matsuoka, K. Seymour, J. Dongarra, C. Lee et al., A Remote Procedure Call API for Grid Computing, Grid computing ? GRID 2002 : third international workshop, 2002.

H. Nakada, Y. Tanaka, S. Matsuoka, and S. Sekiguchi, The design and implementation of a fault-tolerant RPC system: Ninf-C, Proceedings. Seventh International Conference on High Performance Computing and Grid in Asia Pacific Region, 2004., pp.9-18, 2004.
DOI : 10.1109/HPCASIA.2004.1324011

R. «. Namyst, Contribution á la conception de supports exécutifs multithreads performants ». Habilitation á diriger des recherches, 2001.

J. Napper and P. Bientinesi, Can cloud computing reach the top500?, Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, UCHPC-MAW '09, 2009.
DOI : 10.1145/1531666.1531671

A. Petitet, S. Blackford, J. Dongarra, B. Ellis, and G. Fagg, Kenneth Roche and Sathish Vadhiyar. « Numerical Libraries And The Grid : The GrADS Experiments With ScaLA- PACK », 2001.

J. Pjesivac-grbovic, Automatic and Adaptive Optimizations of MPI Collective Operations ». Doctorat en sciences, spécialité informatique, 2007.
DOI : 10.1109/ipdps.2005.335

J. Pjesivac-grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel et al., Performance Analysis of MPI Collective Operations, 19th IEEE International Parallel and Distributed Processing Symposium, p.272, 2005.
DOI : 10.1109/IPDPS.2005.335

J. S. Plank, K. Li, and M. A. Puening, Diskless checkpointing, Diskless Checkpointing, pp.972-986, 1998.
DOI : 10.1109/71.730527
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.4662

A. Pothen and P. Raghavan, Distributed Orthogonal Factorization: Givens and Householder Algorithms, SIAM Journal on Scientific and Statistical Computing, vol.10, issue.6, pp.1113-1134, 1989.
DOI : 10.1137/0910067

D. Powell, « Failure Mode Assumptions and Assumption Coverage, FTCS, pp.386-395, 1992.
DOI : 10.1007/978-3-642-79789-7_8
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5363

R. Rabenseifner, « Automatic MPI Counter Profiling of All Users : First Results on a CRAY T3E, pp.900-512, 1999.

R. Rabenseifner, Optimization of Collective Reduction Operations, Proceedings of the 4th International Conference on Computational Science, pp.1-9, 2004.
DOI : 10.1007/978-3-540-24685-5_1

B. Randell, System structure for software fault tolerance, Proceedings of the international conference on Reliable software, pp.437-449, 1975.
DOI : 10.1007/978-1-4612-6315-9_26
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.578

R. Reddy, A. Lastovetsky, +. Heterompi, and . Scalapack, Towards a ScaLAPACK (Dense Linear Solvers) on Heterogeneous Networks of Computers, Proceedings of the 13th IEEE International Conference on High Performance Computing, pp.242-253, 2006.

D. A. Reed, C. Da, L. , and C. L. Mendes, Reliability challenges in large systems, Future Generation Computer Systems, vol.22, issue.3, pp.293-302, 2006.
DOI : 10.1016/j.future.2004.11.015

A. Rezmerita, T. Morlier, V. Néri, and F. Cappello, Private Virtual Cluster: Infrastructure and Protocol for Instant Grids, Private Virtual Cluster : Infrastructure and Protocol for Instant Grids Proceedings of the 12th European Conference on Parallel and Distributed Computing, pp.393-404, 2006.
DOI : 10.1007/11823285_41

C. Philip, D. C. Roth, B. P. Arnold, and . Miller, « MRNet : A Software-Based Multicast/- Reduction Network for Scalable Tools, Proceedings of the International Conference for High Performance Networking Computing, Networking, Storage and Analysis (SC|03), 2003.

D. Richard, F. B. Schlichting, and . Schneider, « Fail Stop Processors : An Approach to Designing Fault-Tolerant Computing Systems, ACM Transactions on Computer Systems, vol.1, pp.222-238, 1983.

M. Steven and . Bellovin, « Defending Against Sequence Number Attacks, AT1T Research, 1948.

E. Robert, S. A. Strom, and . Yemini, « Optimistic Recovery in Distributed Systems, Transactions on Computer Systems, pp.204-226, 1985.

S. Microsystems and . Inc, « RPC : Remote Procedure Call, Protocol Specification, RFC, vol.1057, issue.2, 1988.
DOI : 10.17487/rfc1050

V. Sunderam and . Pvm, PVM: A framework for parallel distributed computing, Concurrency : Practice and Experience, pp.315-339, 1990.
DOI : 10.1002/cpe.4330020404

R. Takano, M. Matsuda, T. Kudoh, and Y. Kodama, Fumihiro Okazaki and Yutaka Ishikawa. « Effects of packet pacing for MPI programs in a Grid environment, CLUSTER, pp.382-391, 2007.

Y. Tanaka, H. Nakada, S. Sekiguchi, T. Suzumura, and S. Matsuoka, « Ninf-G : A Reference Implementation of RPC-based Programming Middleware for Grid Computing », Journal of Grid Computing, vol.1, issue.1, pp.41-51, 2003.
DOI : 10.1023/A:1024083511032

Y. Tanaka, H. Takemiya, H. Nakada, and S. Sekiguchi, Design, Implementation and Performance Evaluation of GridRPC Programming Middleware for a Large-Scale Computational Grid, Fifth IEEE/ACM International Workshop on Grid Computing, 2004.
DOI : 10.1109/GRID.2004.20

T. Tannenbaum, D. Wright, K. Miller, and M. Livny, Condor ? A Distributed Job Scheduler, Beowulf Cluster Computing with Linux, 2001.

D. Thain, T. Tannenbaum, and M. Livny, Condor and the Grid, Grid Computing : Making the Global Infrastructure a Reality, 2002.
DOI : 10.1002/0470867167.ch11

R. Thakur and W. D. Gropp, Improving the Performance of Collective Operations in MPICH, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI Users' Group Meeting (EuroPVM/MPI'03), pp.257-267, 2003.
DOI : 10.1007/978-3-540-39924-7_38

P. Tröger, H. Rajic, A. Haas, and P. Domagalski, Standardization of an API for Distributed Resource Management Systems, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), pp.619-626, 2007.
DOI : 10.1109/CCGRID.2007.109

R. , C. Whaley, A. Petitet, and J. J. Dongarra, « Automated Empirical Optimization of Software and the ATLAS Project, Parallel Comput, vol.27, issue.12, pp.3-25, 2001.

A. Yarkhan, J. Dongarra, K. Seymour, and . Gridsolve, The Evolution of Network Enabled Solver, Grid-Based Problem Solving Environments : IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments, pp.215-226, 2006.

T. Zhu, Y. Wu, and G. Yang, « Scheduling divisible loads in the dynamic heterogeneous grid environment, publisher = ACM, editor = Xiaohua Jia, year =, Proceedings of the 1st International Conference on Scalable Information Systems (Infoscale 2006) series = ACM International Conference Proceeding Series, volume = 152, 2006.

. Le-rôle-de, Le premier de ces services concerne le cycle de vie de l'application, par son déploiement, son lancement et sa terminaison, et, durant l'exécution, la surveillance de son état et le comportement à suivre en cas de défaillance. L'autre service rendu par l'environnement d'exécution consiste à mettre en relation les processus de l'application pour leur permettre de communiquer en utilisant la bibliothèque de communications. On peut alors décomposer ses fonctionnalités en trois catégories : le déploiement et le lancement de l'application, les communications internes à l'environnement d'exécution (collectives et point-à-point)

. Dans-un-premier and . Temps, échelle de l'environnement d'exécution lui-même, à travers les performances de ses fonctionnalités principales : le lancement d'applications, et les communications internes. Les défaillances étant inévitables dans un système à grande échelle, j'ai ensuite étudié des mécanismes de tolérance aux pannes

. Enfin, type particulier de systèmes à grande échelle avec les grilles de calcul formées par agrégation de grappes, en proposant un environnement de communications MPI adapté aux communications sur grilles en termes d'impératifs de sécurité et reposant sur un environnement d'exécution