T. D&c-recursive, The left and right partitions are executed in parallel before their separator elements on the cut, p.69

W. Gropp, E. Lusk, N. Doss, and A. Skjellum, A high-performance, portable implementation of the MPI message passing interface standard, Parallel Computing, vol.22, issue.6, pp.789-828, 1996.
DOI : 10.1016/0167-8191(96)00024-5

W. Gropp, E. Lusk, and A. Skjellum, Using MPI: portable parallel programming with the messagepassing interface, 1999.

G. E. Blelloch and B. M. Maggs, Parallel algorithms, " in Algorithms and theory of computation handbook, pp.25-25, 2010.

D. Grünewald and C. Simmendinger, The gaspi api specification and its implementation gpi 2.0, 7th International Conference on PGAS Programming Models, 2013.

C. Simmendinger, M. Rahn, and D. Gruenewald, The gaspi api: A failure tolerant pgas api for asynchronous dataflow on heterogeneous architectures, " in Sustained Simulation Performance 2014, pp.17-32, 2015.

C. Simmendinger, J. Jgerskpper, R. Machado, and C. Lojewski, A pgas-based implementation for the unstructured cfd solver tau, 2011.

R. Chandra, Parallel programming in OpenMP, 2001.

G. J. Gorman, J. Southern, P. E. Farrell, M. D. Piggott, G. Rokos et al., Hybrid OpenMP/MPI Anisotropic Mesh Smoothing, Procedia Computer Science, vol.9, pp.1513-1522, 2012.
DOI : 10.1016/j.procs.2012.04.166

URL : https://doi.org/10.1016/j.procs.2012.04.166

D. Schmidl, T. Cramer, S. Wienke, C. Terboven, and M. S. Müller, Assessing the Performance of OpenMP Programs on the Intel Xeon Phi, Proceedings of the 19th International Conference on Parallel Processing, ser. Euro-Par'13, pp.547-558, 2013.
DOI : 10.1007/978-3-642-40047-6_56

X. Guo, M. Lange, G. Gorman, L. Mitchell, and M. Weiland, Developing a scalable hybrid MPI/OpenMP unstructured finite element model, Computers & Fluids, vol.110, pp.227-234, 2015.
DOI : 10.1016/j.compfluid.2014.09.007

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall et al., Cilk: An Efficient Multithreaded Runtime System, Journal of Parallel and Distributed Computing, vol.37, issue.1, pp.55-69, 1996.
DOI : 10.1006/jpdc.1996.0107

URL : http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-548.pdf

M. Frigo, C. E. Leiserson, and K. H. Randall, The implementation of the cilk-5 multithreaded language, ACM Sigplan Notices, pp.212-223, 1998.

C. E. Leiserson, The Cilk++ concurrency platform, The Journal of Supercomputing, vol.8, issue.2, pp.244-257, 2010.
DOI : 10.1002/j.1538-7305.1966.tb01709.x

URL : http://dspace.mit.edu/openaccess-disseminate/1721.1/59828/

J. Reinders, Intel threading building blocks: outfitting C++ for multi-core processor parallelism, 2007.

C. Cecka, A. J. Lew, and E. Darve, Assembly of finite element methods on graphics processors, International Journal for Numerical Methods in Engineering, vol.17, issue.2, pp.640-669, 2011.
DOI : 10.1007/978-3-540-75444-2_37

G. Markall, A. Slemmer, D. Ham, P. Kelly, C. Cantwell et al., Finite element assembly strategies on multi-core and many-core architectures, International Journal for Numerical Methods in Fluids, vol.1, issue.1, pp.80-97, 2013.
DOI : 10.1002/fld.3648

C. Farhat and L. Crivelli, A general approach to nonlinear FE computations on shared-memory multiprocessors, Computer Methods in Applied Mechanics and Engineering, vol.72, issue.2, pp.153-171, 1989.
DOI : 10.1016/0045-7825(89)90157-6

J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, Sparse matrix solvers on the GPU, ACM Transactions on Graphics, vol.22, issue.3, pp.917-924, 2003.
DOI : 10.1145/882262.882364

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, The directory-based cache coherence protocol for the dash multiprocessor, Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90, pp.148-159, 1990.

J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel, WAYPOINT, Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pp.99-110, 2010.
DOI : 10.1145/1854273.1854291

L. Thebault, E. Petit, M. Tchiboukdjian, Q. Dinh, and W. Jalby, Divide and conquer parallelization of finite element method assembly, Advances in Parallel Computing 25, 2014.

E. Petit, L. Thébault, N. Möller, Q. Dinh, and W. Jalby, Task-Based Parallelization of Unstructured Meshes Assembly Using D&C Strategy, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), pp.874-877, 2014.
DOI : 10.1109/HPCC.2014.150

L. Thebault, E. Petit, Q. Dinh, and W. Jalby, Scalable and efficient implementation of 3d unstructured meshes computation: A case study on matrix assembly, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015.

N. Möller, E. Petit, L. Thébault, and Q. Dinh, A Case Study on Using a Proto-Application as a Proxy for Code Modernization, Procedia Computer Science, vol.51, pp.1433-1442, 2015.
DOI : 10.1016/j.procs.2015.05.333

D. Anderson, F. Sparacio, and R. M. Tomasulo, The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling, IBM Journal of Research and Development, vol.11, issue.1, pp.8-24, 1967.
DOI : 10.1147/rd.111.0008

J. R. Goodman, Using cache memory to reduce processor-memory traffic, ACM SIGARCH Computer Architecture News, vol.11, issue.3, pp.124-131, 1983.
DOI : 10.1145/1067651.801647

A. Seznec, The l-tage branch predictor, Journal of Instruction Level Parallelism. Citeseer, 2006.

A. Agarwal, B. Lim, D. Kranz, and J. Kubiatowicz, APRIL: a processor architecture for multiprocessing, 1990.
DOI : 10.21236/ada237476

URL : http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-450.pdf

]. D. Geer, Chip makers turn to multicore processors, Acm Sigplan Notices, pp.11-13, 2002.
DOI : 10.1109/MC.2005.160

J. Lira, C. Molina, and A. González, Analysis of non-uniform cache architecture policies for chipmultiprocessors using the parsec benchmark suite, Proceedings of the workshop on managed many-core systems, pp.1-8, 2009.

W. Arden, M. Brillouët, P. Cogez, M. Graef, B. Huizing et al., Morethan-moore white paper, p.14, 2010.

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash et al., Larrabee: a many-core x86 architecture for visual computing, ACM Transactions on Graphics (TOG), vol.27, issue.3, p.18, 2008.

L. Chen, P. Jiang, and G. Agrawal, Exploiting recent SIMD architectural advances for irregular applications, Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, 2016.
DOI : 10.1145/2442516.2442523

URL : http://dl.acm.org/ft_gateway.cfm?id=2854046&type=pdf

E. Saule, K. Kaya, and Ü. V. Çatalyürek, Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi, Parallel Processing and Applied Mathematics, pp.559-570, 2013.
DOI : 10.1007/978-3-642-55224-3_52

J. J. Dongarra, P. Luszczek, and A. Petitet, The LINPACK Benchmark: past, present and future, Concurrency and Computation: practice and experience, pp.803-820, 2003.
DOI : 10.1137/1.9780898719642

J. Dongarra and M. A. Heroux, Toward a new metric for ranking high performance computing systems, Sandia Report, vol.312, pp.2013-4744, 2013.

C. A. Patterson, M. Snir, and S. L. Graham, Getting Up to Speed:: The Future of Supercomputing, 2005.

J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with cuda, pp.40-53, 2008.
DOI : 10.1145/1401132.1401152

J. E. Stone, D. Gohara, and G. Shi, OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems, Computing in Science & Engineering, vol.12, issue.3, pp.66-73, 2010.
DOI : 10.1109/MCSE.2010.69

URL : http://europepmc.org/articles/pmc2964860?pdf=render

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra et al., Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.97-104, 2004.
DOI : 10.1007/978-3-540-30218-6_19

H. Tang and T. Yang, Optimizing threaded MPI execution on SMP clusters, Proceedings of the 15th international conference on Supercomputing , ICS '01, pp.381-392, 2001.
DOI : 10.1145/377792.377895

URL : http://www.cs.ucsb.edu/~tyang/papers/ics01.ps

C. Huang, O. Lawlor, and L. V. Kale, Adaptive MPI, Languages and Compilers for Parallel Computing, pp.306-322, 2003.
DOI : 10.1007/978-3-540-24644-2_20

M. Pérache, P. Carribault, and H. Jourdren, MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.94-103, 2009.
DOI : 10.1007/3-540-27039-6_19

F. O. Carroll, H. Tezuka, A. Hori, and Y. Ishikawa, The design and implementation of zero copy mpi using commodity hardware with a high performance network, Proceedings of the 12th international conference on Supercomputing, pp.243-250, 1998.

M. J. Koop, S. Sur, and D. K. Panda, Zero-copy protocol for MPI using infiniband unreliable datagram, 2007 IEEE International Conference on Cluster Computing, pp.179-186, 2007.
DOI : 10.1109/CLUSTR.2007.4629230

URL : http://www.cse.ohio-state.edu/~koop/pub/koop-cluster07.pdf

W. Gropp, E. Lusk, and R. Thakur, Using MPI-2: Advanced features of the message-passing interface, 1999.

R. Thakur, W. Gropp, and B. Toonen, Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication, The International Journal of High Performance Computing Applications, vol.19, issue.2, pp.119-128, 2005.
DOI : 10.1109/SC.2000.10023

R. Gerstenberger, M. Besta, and T. Hoefler, Enabling highly-scalable remote memory access programming with mpi-3 one sided, High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for, pp.1-12, 2013.
DOI : 10.1155/2014/571902

URL : https://doi.org/10.1155/2014/571902

T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji et al., Remote Memory Access Programming in MPI-3, ACM Transactions on Parallel Computing, vol.2, issue.2, p.9, 2015.
DOI : 10.1145/2555243.2555270

R. Belli and T. Hoefler, Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.871-881, 2015.
DOI : 10.1109/IPDPS.2015.30

URL : http://htor.inf.ethz.ch/publications/img/notified-access-extending-rma.pdf

I. Fraunhofer, Gpi-global address space programming interface, 2013.

W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks et al., Introduction to UPC and language specification, Center for Computing Sciences, Institute for Defense Analyses, 1999.

A. Aiken, P. Colella, D. Gay, S. Graham, P. Hilfinger et al., Titanium: A high-performance java dialect, Concurrency: Practice and Experience, pp.11-13, 1998.

R. W. Numrich and J. Reid, Co-array Fortran for parallel programming, ACM Sigplan Fortran Forum, pp.1-31, 1998.
DOI : 10.1145/289918.289920

URL : http://caf.rice.edu/documentation/nrRAL98060.pdf

B. L. Chamberlain, D. Callahan, and H. P. Zima, Parallel Programmability and the Chapel Language, The International Journal of High Performance Computing Applications, vol.8, issue.3, pp.291-312, 2007.
DOI : 10.1002/(SICI)1096-9128(199809/11)10:11/13<825::AID-CPE383>3.0.CO;2-H

URL : http://www.cs.utexas.edu/%7Elin/cs380p/chapel07.pdf

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra et al., X10, ACM SIGPLAN Notices, vol.40, issue.10, pp.519-538, 2005.
DOI : 10.1145/1103845.1094852

URL : https://hal.archives-ouvertes.fr/in2p3-00166974

D. Bonachea and J. Duell, Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations, International Journal of High Performance Computing and Networking, vol.1, issue.1/2/3, pp.91-99, 2004.
DOI : 10.1504/IJHPCN.2004.007569

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, Optimizing bandwidth limited problems using one-sided communication and overlap, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, p.10, 2006.
DOI : 10.1109/IPDPS.2006.1639320

URL : http://http.cs.berkeley.edu/~bonachea/upc/upc_bisection_IPDPS06.pdf

P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, vol.40, issue.7, pp.224-238, 2014.
DOI : 10.1016/j.parco.2013.06.001

F. Shahzad, M. Wittmann, M. Kreutzer, T. Zeiser, G. Hager et al., Pgas implementation of spmvm and lbm using gpi, 7th International Conference on PGAS Programming Models, p.172, 2013.

R. Machado, S. Abreu, and D. Diaz, Parallel local search: Experiments with a pgas-based programming model, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00735787

P. Jarzebski, K. Wisniewski, and R. Taylor, On parallelization of the loop over elements in FEAP, Computational Mechanics, vol.90, issue.2, pp.77-86, 2015.
DOI : 10.1002/nme.3335

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, Starpu: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00550877

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., PaRSEC: Exploiting Heterogeneity to Enhance Scalability, Computing in Science & Engineering, vol.15, issue.6, pp.36-45, 2013.
DOI : 10.1109/MCSE.2013.98

C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault, StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators, 2012.
DOI : 10.1007/978-3-642-33518-1_40

URL : https://hal.archives-ouvertes.fr/hal-00992208

O. Delannoy, F. Emad, and S. Petiton, Workflow Global Computing with YML, 2006 7th IEEE/ACM International Conference on Grid Computing, pp.25-32, 2006.
DOI : 10.1109/ICGRID.2006.310994

URL : https://hal.archives-ouvertes.fr/hal-00141650

A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell et al., OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES, Parallel Processing Letters, vol.30, issue.02, pp.173-193, 2011.
DOI : 10.1016/j.jcp.2004.10.011

P. Gonnet, A. B. Chalk, and M. Schaller, Quicksched: Task-based parallelism with dependencies and conflicts, 2016.

R. Blikberg and T. Sørevik, Nested Parallelism: Allocation of Threads to Tasks and OpenMP Implementation, Scientific Programming, pp.185-194, 2001.
DOI : 10.1155/2001/821575

S. Shah, G. Haab, P. Petersen, and J. Throop, Flexible control structures for parallelism in OpenMP, Concurrency: Practice and Experience, pp.1219-1239, 2000.
DOI : 10.1109/TC.1987.5009478

E. Su, X. Tian, M. Girkar, G. Haab, S. Shah et al., Compiler support of the workqueuing execution model for intel smp architectures, Fourth European Workshop on OpenMP, 2002.

E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin et al., The Design of OpenMP Tasks, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.3, pp.404-418, 2009.
DOI : 10.1109/TPDS.2008.105

E. Ayguadé, A. Duran, J. Hoeflinger, F. Massaioli, and X. Teruel, An Experimental Evaluation of the New OpenMP Tasking Model, Languages and Compilers for Parallel Computing, pp.63-77, 2007.
DOI : 10.1007/978-3-540-85261-2_5

A. Duran, J. Corbalán, and E. Ayguadé, Evaluation of openmp task scheduling strategies, " in OpenMP in a new era of parallelism, pp.100-110, 2008.

S. L. Olivier and J. F. Prins, Evaluating openmp 3.0 run time systems on unbalanced task graphs, " in Evolving OpenMP in an Age of Extreme Parallelism, pp.63-78, 2009.

S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan et al., UTS: An Unbalanced Tree Search Benchmark, Languages and Compilers for Parallel Computing, pp.235-250, 2006.
DOI : 10.1007/978-3-540-72521-3_18

URL : http://people.eecs.ku.edu/~jhuan/papers/lcpc06.pdf

M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-berlin, Reducers and other Cilk++ hyperobjects, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA '09, pp.79-90, 2009.
DOI : 10.1145/1583991.1584017

R. D. Blumofe and C. E. Leiserson, Scheduling multithreaded computations by work stealing, Journal of the ACM, vol.46, issue.5, pp.720-748, 1999.
DOI : 10.1145/324133.324234

URL : http://csdl.computer.org/comp/proceedings/sfcs/1994/6580/00/0365680.pdf

J. T. Fineman and C. E. Leiserson, Race detectors for cilk and cilk++ programs, Encyclopedia of Parallel Computing, pp.1706-1719, 2011.

Y. He, C. E. Leiserson, and W. M. Leiserson, The Cilkview scalability analyzer, Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, SPAA '10, pp.145-156, 2010.
DOI : 10.1145/1810479.1810509

URL : http://www.csd.uwo.ca/~moreno/CS433-CS9624/Resources/p145-he.pdf

T. B. Schardl, B. C. Kuszmaul, I. Lee, W. M. Leiserson, and C. E. Leiserson, The Cilkprof Scalability Profiler, Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA '15, pp.89-100, 2015.
DOI : 10.1145/1594835.1504210

URL : http://dspace.mit.edu/bitstream/1721.1/113050/1/Leiserson_The%20cilkprof.pdf

C. Luk, R. Newton, W. Hasenplaugh, M. Hampton, and G. Lowney, A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops, IEEE Software, vol.28, issue.1, p.39, 2011.
DOI : 10.1109/MS.2011.2

Y. Saad91, ]. A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert et al., Iterative methods for sparse linear systems. Siam Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pp.233-244, 2003.

G. M. Morton, A computer oriented geodetic data base and a new technique in file sequencing, International Business Machines Company, 1966.

A. Buluç, S. Williams, L. Oliker, and J. Demmel, Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication, Parallel & Distributed Processing Symposium (IPDPS), pp.721-733, 2011.

R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, When cache blocking of sparse matrix vector multiply works and why, Applicable Algebra in Engineering, Communication and Computing, vol.18, issue.3, pp.297-311, 2007.
DOI : 10.1007/s00200-007-0038-9

URL : http://bebop.cs.berkeley.edu/pubs/nishtala2007-cb-spmv.pdf

E. Im, K. Yelick, and R. Vuduc, Sparsity: Optimization Framework for Sparse Matrix Kernels, The International Journal of High Performance Computing Applications, vol.18, issue.1, pp.135-158, 2004.
DOI : 10.1007/BF01388687

URL : http://jsbach.kookmin.ac.kr/ejim/papers/ijhpca04.pdf

M. Martone, S. Filippone, S. Tucci, M. Paprzycki, and M. Ganzha, Utilizing recursive storage in sparse matrix-vector multiplication-preliminary considerations, CATA, pp.300-305, 2010.

M. Martone, S. Filippone, M. Paprzycki, and S. Tucci, On BLAS Operations with Recursively Stored Sparse Matrices, 2010 12th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp.49-56, 2010.
DOI : 10.1109/SYNASC.2010.72

M. Martone, S. Filippone, P. Gepner, M. Paprzycki, and S. Tucci, Use of hybrid recursive CSR/- COO data structures in sparse matrices-vector multiplication, International Multiconference on Computer Science and Information Technology -IMCSIT, pp.327-335, 2010.

M. Martone, S. Filippone, S. Tucci, and M. Paprzycki, Assembling recursively stored sparse matrices, Proceedings of the International Multiconference on Computer Science and Information Technology, pp.317-325, 2010.
DOI : 10.1109/IMCSIT.2010.5680036

URL : http://www.proceedings2010.imcsit.org/pliks/205.pdf

K. Kourtis, G. Goumas, and N. Koziris, Optimizing sparse matrix-vector multiplication using index and value compression, Proceedings of the 2008 conference on Computing frontiers , CF '08, pp.87-96, 2008.
DOI : 10.1145/1366230.1366244

URL : http://www.cslab.ece.ntua.gr/~nkoziris/papers/cf08-spmv-kkourt.pdf

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units, SIAM Journal on Scientific Computing, vol.36, issue.5, pp.401-423, 2014.
DOI : 10.1137/130930352

J. R. Rice and R. F. Boisvert, Solving elliptic problems using ELLPACK, 2012.
DOI : 10.1007/978-1-4612-5018-0

H. C. Edwards, D. Sunderland, V. Porter, C. Amsler, and S. Mish, Manycore performanceportability: Kokkos multidimensional array library, Scientific Programming, pp.89-114, 2012.
DOI : 10.1155/2012/917630

URL : http://doi.org/10.1155/2012/917630

M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu et al., An overview of the Trilinos project, ACM Transactions on Mathematical Software, vol.31, issue.3, pp.397-423, 2005.
DOI : 10.1145/1089014.1089021

M. T. Heath and P. Raghavan, A Cartesian Parallel Nested Dissection Algorithm, SIAM Journal on Matrix Analysis and Applications, vol.16, issue.1, pp.235-253, 1995.
DOI : 10.1137/S0895479892238270

G. L. Miller, S. Teng, and S. A. Vavasis, A unified geometric approach to graph separators, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science, pp.538-547, 1991.
DOI : 10.1109/SFCS.1991.185417

M. J. Berger and S. H. Bokhari, A Partitioning Strategy for Nonuniform Problems on Multiprocessors, IEEE Transactions on Computers, vol.36, issue.5, pp.570-580, 1987.
DOI : 10.1109/TC.1987.1676942

R. D. Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations, Concurrency: Practice and experience, pp.457-481, 1991.
DOI : 10.1007/978-1-4613-1627-5

A. Ytterström, A Tool for Partitioning Structured Multiblock Meshes for Parallel Computational Mechanics, The International Journal of Supercomputer Applications and High Performance Computing, vol.20, issue.4, pp.336-343, 1997.
DOI : 10.1007/BF01933580

C. Lachat, C. Dobrzynski, and F. Pellegrini, Parallel mesh adaptation using parallel graph partitioning, 5th European Conference on Computational Mechanics (ECCM V), pp.2612-2623, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01099259

C. Chevalier and F. Pellegrini, PT-Scotch: A tool for efficient parallel graph ordering, Parallel Computing, vol.34, issue.6-8, pp.318-331, 2008.
DOI : 10.1016/j.parco.2007.12.001

URL : https://hal.archives-ouvertes.fr/hal-00402893

C. Dobrzynski and P. Frey, Anisotropic Delaunay Mesh Adaptation for Unsteady Simulations, Proceedings of the 17th international Meshing Roundtable, pp.177-194, 2008.
DOI : 10.1007/978-3-540-87921-3_11

URL : https://hal.archives-ouvertes.fr/hal-00353786

H. D. Simon, Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, vol.2, issue.2-3, pp.135-148, 1991.
DOI : 10.1016/0956-0521(91)90014-V

S. T. Barnard and H. D. Simon, Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems, Concurrency: Practice and experience, pp.101-117, 1994.
DOI : 10.1002/j.1538-7305.1970.tb01770.x

G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme for irregular graphs, department of computer science, pp.96-132, 1996.

G. Karypis and V. Kumar, Metis?unstructured graph partitioning and sparse matrix ordering system, version 2.0, 1995.

G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme for irregular graphs, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '96, pp.96-129, 1998.
DOI : 10.1145/369028.369103

URL : http://www.cs.umn.edu/~kumar/papers/mlevel_kparallel.ps

G. Karypis and V. Kumar, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM Journal on Scientific Computing, vol.20, issue.1, pp.359-392, 1998.
DOI : 10.1137/S1064827595287997

URL : http://glaros.dtc.umn.edu/gkhome/fetch/papers/mlSIAMSC99.pdf

B. Hendrickson and R. Leland, A multilevel algorithm for partitioning graphs, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '95, p.28, 1995.
DOI : 10.1145/224170.224228

T. N. Bui and C. Jones, A heuristic for reducing fill-in in sparse matrix factorization, Society for Industrial and Applied Mathematics (SIAM), 1993.

B. W. Kernighan and S. Lin, An Efficient Heuristic Procedure for Partitioning Graphs, Bell System Technical Journal, vol.49, issue.2, pp.291-307, 1970.
DOI : 10.1002/j.1538-7305.1970.tb01770.x

D. Lasalle and G. Karypis, Multi-threaded Graph Partitioning, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.225-236, 2013.
DOI : 10.1109/IPDPS.2013.50

G. Karypis, K. Schloegel, and V. Kumar, Parmetis, " Parallel graph partitioning and sparse matrix ordering library, Version, vol.2, 2003.
DOI : 10.1006/jpdc.1997.1403

URL : http://www.cs.umn.edu/~kumar/papers/mlevel_parallel.ps

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick et al., Optimization of sparse matrix???vector multiplication on emerging multicore platforms, Parallel Computing, vol.35, issue.3, pp.178-194, 2009.
DOI : 10.1016/j.parco.2008.12.006

N. Gourdain, L. Gicquel, M. Montagnac, O. Vermorel, M. Gazaix et al., High performance parallel computing of flows in complex geometries: I. Methods, Computational Science & Discovery, vol.2, issue.1, p.15003, 2009.
DOI : 10.1088/1749-4699/2/1/015003

L. Qu, L. Grigori, and F. Nataf, Parallel design and performance of nested filtering factorization preconditioner Storage and Analysis, ser. SC '13, Proceedings of the International Conference on High Performance Computing, Networking, pp.811-8112, 2013.

J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar et al., Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.945-955, 2014.
DOI : 10.1109/SC.2014.82

M. Garcia, Analysis of precision differences observed for the avbp code, Tech. Rep, 2003.

A. H. Gebremedhin, D. Nguyen, M. M. Patwary, and A. Pothen, ColPack, ACM Transactions on Mathematical Software, vol.40, issue.1, p.1, 2013.
DOI : 10.1145/2513109.2513110

G. Rokos, G. J. Gorman, K. E. Jensen, and P. H. Kelly, Thread parallelism for highly irregular computation in anisotropic mesh adaptation, Proceedings of the 3rd International Conference on Exascale Applications and Software, pp.103-108, 2015.

A. H. Gebremedhin and F. Manne, Scalable parallel graph coloring algorithms, Concurrency - Practice and Experience, pp.1131-1146, 2000.
DOI : 10.1006/jpdc.1996.0117

URL : http://www.cs.odu.edu/~assefaw/pub/cpe-color.ps

Ü. V. Çatalyürek, J. Feo, A. H. Gebremedhin, M. Halappanavar, and A. Pothen, Graph coloring algorithms for multi-core and massively multithreaded architectures, Parallel Computing, vol.38, issue.10-11, pp.576-594, 2012.
DOI : 10.1016/j.parco.2012.07.001

G. Rokos, G. Gorman, and P. H. Kelly, A fast and scalable graph coloring algorithm for multicore and many-core architectures, Euro-Par 2015: Parallel Processing, pp.414-425, 2015.
DOI : 10.1007/978-3-662-48096-0_32

URL : http://arxiv.org/pdf/1505.04086

E. Horowitz and A. Zorat, Divide-and-Conquer for Parallel Processing, IEEE Transactions on Computers, vol.32, issue.6, pp.582-585, 1983.
DOI : 10.1109/TC.1983.1676280

C. A. Hoare, Quicksort, The Computer Journal, vol.5, issue.1, pp.10-16, 1962.
DOI : 10.1093/comjnl/5.1.10

R. C. Singleton, An algorithm for computing the mixed radix fast Fourier transform, IEEE Transactions on Audio and Electroacoustics, vol.17, issue.2, pp.93-103, 1969.
DOI : 10.1109/TAU.1969.1162042

M. Tchiboukdjian, V. Danjean, and B. Raffin, Binary Mesh Partitioning for Cache-Efficient Visualization, IEEE Transactions on Visualization and Computer Graphics, vol.16, issue.5, pp.815-828, 2010.
DOI : 10.1109/TVCG.2010.19

URL : https://hal.archives-ouvertes.fr/hal-00685930

M. Tchiboukdjian, V. Danjean, and B. Raffin, Cache-efficient parallel isosurface extraction for shared cache multicores, EGPGV, pp.81-90, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00798445

M. Tchiboukdjian, V. Danjean, T. Gautier, F. L. Mentec, and B. Raffin, A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores, Highly Parallel Processing on a Chip (HPPC), 2010.
DOI : 10.1145/1693453.1693482

T. Guillet and M. Tchiboukdjian, Scalable and composable shared memory parallelism with tasks for multicore and manycore, Exascale Challenges Workshop, TERATEC Forum, 2012.

J. Dongarra, V. Eijkhout, and P. Luszczek, Recursive Approach in Sparse Matrix LU Factorization, Scientific Programming, vol.9, issue.1, pp.51-60, 2001.
DOI : 10.1155/2001/569670

URL : https://doi.org/10.1155/2001/569670

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009.
DOI : 10.1016/j.parco.2008.10.002

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, Cache-oblivious algorithms, Foundations of Computer Science 40th Annual Symposium on. IEEE, pp.285-297, 1999.
DOI : 10.1109/sffcs.1999.814600

J. D. Frens and D. S. Wise, Auto-blocking matrix-multiplication or tracking blas3 performance from source code, ACM SIGPLAN Notices, pp.206-216, 1997.
DOI : 10.1145/263767.263789

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi, Recursive array layouts and fast matrix multiplication Parallel and Distributed Systems, IEEE Transactions on, vol.13, issue.11, pp.1105-1123, 2002.
DOI : 10.1109/tpds.2002.1058095

URL : http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA440384&Location=U2&doc=GetTRDoc.pdf

E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström, Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software, SIAM Review, vol.46, issue.1, pp.3-45, 2004.
DOI : 10.1137/S0036144503428693

C. Vömel, S. Tomov, and J. Dongarra, Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems, SIAM Journal on Scientific Computing, vol.34, issue.2, pp.70-82, 2012.
DOI : 10.1137/100806783

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak et al., Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, p.12037, 2009.
DOI : 10.1088/1742-6596/180/1/012037

URL : http://iopscience.iop.org/article/10.1088/1742-6596/180/1/012037/pdf

D. Goudin and J. Roman, A Scalable Parallel Assembly for Irregular Meshes Based on a Block Distribution for a Parallel Block Direct Solver, Applied Parallel Computing. New Paradigms for HPC in Industry and Academia, pp.113-120, 2000.
DOI : 10.1007/3-540-70734-4_15

I. Wu and H. Kung, Communication complexity for parallel divide-and-conquer, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science, pp.151-162, 1991.
DOI : 10.1109/SFCS.1991.185364

URL : http://www.cs.cmu.edu/afs/cs/project/cmcl/archive/Nectar-papers/91focs.ps

D. A. Padua, M. J. Wolfe, S. Maleki, Y. Gao, M. J. Garzaran et al., Advanced compiler optimizations for supercomputers An evaluation of vectorizing compilers, Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, pp.1184-1201, 1986.

I. Z. Reguly, E. László, G. R. Mudalige, and M. B. Giles, Vectorizing unstructured mesh computations for many-core architectures, Concurrency and Computation: Practice and Experience, 2015.
DOI : 10.1002/cpe.3621

URL : http://www.oerc.ox.ac.uk/sites/default/files/uploads/profile-pages/Gihan/p39-reguly.pdf

G. Mudalige, M. Giles, I. Reguly, C. Bertolli, and P. Kelly, OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures, 2012 Innovative Parallel Computing (InPar), pp.1-12, 2012.
DOI : 10.1109/InPar.2012.6339594

D. A. Burgess, P. I. Crumpton, and M. B. Giles, A Parallel Framework for Unstructured Grid Solvers, 1994.
DOI : 10.1007/978-3-0348-8534-8_10

URL : http://www-sccm.stanford.edu/~burgess/papers/OPlus.ps.gz

P. I. Crumpton and M. B. Giles, Multigrid aircraft computations using the oplus parallel library, " in Parallel Computational Fluid Dynamics: Implementation and Results using Parallel Computers, Proceedings Parallel CFD, vol.95, pp.339-346, 1996.

R. Löhner, Cache-efficient renumbering for vectorization, International Journal for Numerical Methods in Biomedical Engineering, vol.26, issue.5, pp.628-636, 2010.

E. Cuthill and J. Mckee, Reducing the bandwidth of sparse symmetric matrices, Proceedings of the 1969 24th national conference on -, pp.157-172, 1969.
DOI : 10.1145/800195.805928

L. Oliker, X. Li, P. Husbands, and R. Biswas, Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations, SIAM Review, vol.44, issue.3, pp.373-393, 2002.
DOI : 10.1137/S00361445003820

L. Oliker, X. Li, G. Heber, and R. Biswas, Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms, 2000.

G. Heber, R. Biswas, and G. R. Gao, Self-avoiding walks over adaptive unstructured grids, Concurrency: Practice and Experience, pp.85-109, 2000.
DOI : 10.1007/bfb0097981

A. Yzelman and R. H. Bisseling, Cache-Oblivious Sparse Matrix???Vector Multiplication by Using Sparse Matrix Partitioning Methods, SIAM Journal on Scientific Computing, vol.31, issue.4, pp.3128-3154, 2009.
DOI : 10.1137/080733243

URL : http://www.math.uu.nl/people/bisseling/Mondriaan/yzelman09.pdf

H. Sagan, Space-filling curves, 2012.
DOI : 10.1007/978-1-4612-0871-6

A. Yzelman and R. H. Bisseling, Two-dimensional cache-oblivious sparse matrix???vector multiplication, Parallel Computing, vol.37, issue.12, pp.806-819, 2011.
DOI : 10.1016/j.parco.2011.08.004

A. George, Nested Dissection of a Regular Finite Element Mesh, SIAM Journal on Numerical Analysis, vol.10, issue.2, pp.345-363, 1973.
DOI : 10.1137/0710032

I. Brainman and S. Toledo, Nested-Dissection Orderings for Sparse LU with Partial Pivoting, SIAM Journal on Matrix Analysis and Applications, vol.23, issue.4, pp.998-1012, 2002.
DOI : 10.1137/S0895479801385037

URL : http://www.math.tau.ac.il/~sivan/Pubs/wide.pdf

P. Cicotti, L. Carrington, and A. Chien, Toward application-specific memory reconfiguration for energy efficiency, Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, E2SC '13, p.2, 2013.
DOI : 10.1145/2536430.2536434

M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards et al., Improving performance via mini-applications, Sandia National Laboratories, vol.3, 2009.

R. F. Barrett, S. Borkar, S. S. Dosanjh, S. D. Hammond, M. A. Heroux et al., On the role of co-design in high performance computing, pp.141-155, 2013.

S. L. Olivier and J. F. Prins, Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs, International Journal of Parallel Programming, vol.11, issue.1, pp.5-6, 2010.
DOI : 10.1007/s10766-010-0140-7

P. Langlois, D. Parello, B. Goossens, and K. Porada, Less hazardous and more scientific research for summation algorithm computing times, 2012.
URL : https://hal.archives-ouvertes.fr/lirmm-00737617

S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Functional Simulator for GPGPU, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp.351-360, 2010.
DOI : 10.1109/MASCOTS.2010.43