The left and right partitions are executed in parallel before their separator elements on the cut, p.69 ,
A high-performance, portable implementation of the MPI message passing interface standard, Parallel Computing, vol.22, issue.6, pp.789-828, 1996. ,
DOI : 10.1016/0167-8191(96)00024-5
Using MPI: portable parallel programming with the messagepassing interface, 1999. ,
Parallel algorithms, " in Algorithms and theory of computation handbook, pp.25-25, 2010. ,
The gaspi api specification and its implementation gpi 2.0, 7th International Conference on PGAS Programming Models, 2013. ,
The gaspi api: A failure tolerant pgas api for asynchronous dataflow on heterogeneous architectures, " in Sustained Simulation Performance 2014, pp.17-32, 2015. ,
A pgas-based implementation for the unstructured cfd solver tau, 2011. ,
Parallel programming in OpenMP, 2001. ,
Hybrid OpenMP/MPI Anisotropic Mesh Smoothing, Procedia Computer Science, vol.9, pp.1513-1522, 2012. ,
DOI : 10.1016/j.procs.2012.04.166
URL : https://doi.org/10.1016/j.procs.2012.04.166
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi, Proceedings of the 19th International Conference on Parallel Processing, ser. Euro-Par'13, pp.547-558, 2013. ,
DOI : 10.1007/978-3-642-40047-6_56
Developing a scalable hybrid MPI/OpenMP unstructured finite element model, Computers & Fluids, vol.110, pp.227-234, 2015. ,
DOI : 10.1016/j.compfluid.2014.09.007
Cilk: An Efficient Multithreaded Runtime System, Journal of Parallel and Distributed Computing, vol.37, issue.1, pp.55-69, 1996. ,
DOI : 10.1006/jpdc.1996.0107
URL : http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-548.pdf
The implementation of the cilk-5 multithreaded language, ACM Sigplan Notices, pp.212-223, 1998. ,
The Cilk++ concurrency platform, The Journal of Supercomputing, vol.8, issue.2, pp.244-257, 2010. ,
DOI : 10.1002/j.1538-7305.1966.tb01709.x
URL : http://dspace.mit.edu/openaccess-disseminate/1721.1/59828/
Intel threading building blocks: outfitting C++ for multi-core processor parallelism, 2007. ,
Assembly of finite element methods on graphics processors, International Journal for Numerical Methods in Engineering, vol.17, issue.2, pp.640-669, 2011. ,
DOI : 10.1007/978-3-540-75444-2_37
Finite element assembly strategies on multi-core and many-core architectures, International Journal for Numerical Methods in Fluids, vol.1, issue.1, pp.80-97, 2013. ,
DOI : 10.1002/fld.3648
A general approach to nonlinear FE computations on shared-memory multiprocessors, Computer Methods in Applied Mechanics and Engineering, vol.72, issue.2, pp.153-171, 1989. ,
DOI : 10.1016/0045-7825(89)90157-6
Sparse matrix solvers on the GPU, ACM Transactions on Graphics, vol.22, issue.3, pp.917-924, 2003. ,
DOI : 10.1145/882262.882364
The directory-based cache coherence protocol for the dash multiprocessor, Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90, pp.148-159, 1990. ,
WAYPOINT, Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pp.99-110, 2010. ,
DOI : 10.1145/1854273.1854291
Divide and conquer parallelization of finite element method assembly, Advances in Parallel Computing 25, 2014. ,
Task-Based Parallelization of Unstructured Meshes Assembly Using D&C Strategy, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), pp.874-877, 2014. ,
DOI : 10.1109/HPCC.2014.150
Scalable and efficient implementation of 3d unstructured meshes computation: A case study on matrix assembly, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015. ,
A Case Study on Using a Proto-Application as a Proxy for Code Modernization, Procedia Computer Science, vol.51, pp.1433-1442, 2015. ,
DOI : 10.1016/j.procs.2015.05.333
The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling, IBM Journal of Research and Development, vol.11, issue.1, pp.8-24, 1967. ,
DOI : 10.1147/rd.111.0008
Using cache memory to reduce processor-memory traffic, ACM SIGARCH Computer Architecture News, vol.11, issue.3, pp.124-131, 1983. ,
DOI : 10.1145/1067651.801647
The l-tage branch predictor, Journal of Instruction Level Parallelism. Citeseer, 2006. ,
APRIL: a processor architecture for multiprocessing, 1990. ,
DOI : 10.21236/ada237476
URL : http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-450.pdf
Chip makers turn to multicore processors, Acm Sigplan Notices, pp.11-13, 2002. ,
DOI : 10.1109/MC.2005.160
Analysis of non-uniform cache architecture policies for chipmultiprocessors using the parsec benchmark suite, Proceedings of the workshop on managed many-core systems, pp.1-8, 2009. ,
Morethan-moore white paper, p.14, 2010. ,
Larrabee: a many-core x86 architecture for visual computing, ACM Transactions on Graphics (TOG), vol.27, issue.3, p.18, 2008. ,
Exploiting recent SIMD architectural advances for irregular applications, Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, 2016. ,
DOI : 10.1145/2442516.2442523
URL : http://dl.acm.org/ft_gateway.cfm?id=2854046&type=pdf
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi, Parallel Processing and Applied Mathematics, pp.559-570, 2013. ,
DOI : 10.1007/978-3-642-55224-3_52
The LINPACK Benchmark: past, present and future, Concurrency and Computation: practice and experience, pp.803-820, 2003. ,
DOI : 10.1137/1.9780898719642
Toward a new metric for ranking high performance computing systems, Sandia Report, vol.312, pp.2013-4744, 2013. ,
Getting Up to Speed:: The Future of Supercomputing, 2005. ,
Scalable parallel programming with cuda, pp.40-53, 2008. ,
DOI : 10.1145/1401132.1401152
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems, Computing in Science & Engineering, vol.12, issue.3, pp.66-73, 2010. ,
DOI : 10.1109/MCSE.2010.69
URL : http://europepmc.org/articles/pmc2964860?pdf=render
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.97-104, 2004. ,
DOI : 10.1007/978-3-540-30218-6_19
Optimizing threaded MPI execution on SMP clusters, Proceedings of the 15th international conference on Supercomputing , ICS '01, pp.381-392, 2001. ,
DOI : 10.1145/377792.377895
URL : http://www.cs.ucsb.edu/~tyang/papers/ics01.ps
Adaptive MPI, Languages and Compilers for Parallel Computing, pp.306-322, 2003. ,
DOI : 10.1007/978-3-540-24644-2_20
MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.94-103, 2009. ,
DOI : 10.1007/3-540-27039-6_19
The design and implementation of zero copy mpi using commodity hardware with a high performance network, Proceedings of the 12th international conference on Supercomputing, pp.243-250, 1998. ,
Zero-copy protocol for MPI using infiniband unreliable datagram, 2007 IEEE International Conference on Cluster Computing, pp.179-186, 2007. ,
DOI : 10.1109/CLUSTR.2007.4629230
URL : http://www.cse.ohio-state.edu/~koop/pub/koop-cluster07.pdf
Using MPI-2: Advanced features of the message-passing interface, 1999. ,
Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication, The International Journal of High Performance Computing Applications, vol.19, issue.2, pp.119-128, 2005. ,
DOI : 10.1109/SC.2000.10023
Enabling highly-scalable remote memory access programming with mpi-3 one sided, High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for, pp.1-12, 2013. ,
DOI : 10.1155/2014/571902
URL : https://doi.org/10.1155/2014/571902
Remote Memory Access Programming in MPI-3, ACM Transactions on Parallel Computing, vol.2, issue.2, p.9, 2015. ,
DOI : 10.1145/2555243.2555270
Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.871-881, 2015. ,
DOI : 10.1109/IPDPS.2015.30
URL : http://htor.inf.ethz.ch/publications/img/notified-access-extending-rma.pdf
Gpi-global address space programming interface, 2013. ,
Introduction to UPC and language specification, Center for Computing Sciences, Institute for Defense Analyses, 1999. ,
Titanium: A high-performance java dialect, Concurrency: Practice and Experience, pp.11-13, 1998. ,
Co-array Fortran for parallel programming, ACM Sigplan Fortran Forum, pp.1-31, 1998. ,
DOI : 10.1145/289918.289920
URL : http://caf.rice.edu/documentation/nrRAL98060.pdf
Parallel Programmability and the Chapel Language, The International Journal of High Performance Computing Applications, vol.8, issue.3, pp.291-312, 2007. ,
DOI : 10.1002/(SICI)1096-9128(199809/11)10:11/13<825::AID-CPE383>3.0.CO;2-H
URL : http://www.cs.utexas.edu/%7Elin/cs380p/chapel07.pdf
X10, ACM SIGPLAN Notices, vol.40, issue.10, pp.519-538, 2005. ,
DOI : 10.1145/1103845.1094852
URL : https://hal.archives-ouvertes.fr/in2p3-00166974
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations, International Journal of High Performance Computing and Networking, vol.1, issue.1/2/3, pp.91-99, 2004. ,
DOI : 10.1504/IJHPCN.2004.007569
Optimizing bandwidth limited problems using one-sided communication and overlap, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, p.10, 2006. ,
DOI : 10.1109/IPDPS.2006.1639320
URL : http://http.cs.berkeley.edu/~bonachea/upc/upc_bisection_IPDPS06.pdf
Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm, Parallel Computing, vol.40, issue.7, pp.224-238, 2014. ,
DOI : 10.1016/j.parco.2013.06.001
Pgas implementation of spmvm and lbm using gpi, 7th International Conference on PGAS Programming Models, p.172, 2013. ,
Parallel local search: Experiments with a pgas-based programming model, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00735787
On parallelization of the loop over elements in FEAP, Computational Mechanics, vol.90, issue.2, pp.77-86, 2015. ,
DOI : 10.1002/nme.3335
Starpu: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, pp.187-198, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00550877
PaRSEC: Exploiting Heterogeneity to Enhance Scalability, Computing in Science & Engineering, vol.15, issue.6, pp.36-45, 2013. ,
DOI : 10.1109/MCSE.2013.98
StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators, 2012. ,
DOI : 10.1007/978-3-642-33518-1_40
URL : https://hal.archives-ouvertes.fr/hal-00992208
Workflow Global Computing with YML, 2006 7th IEEE/ACM International Conference on Grid Computing, pp.25-32, 2006. ,
DOI : 10.1109/ICGRID.2006.310994
URL : https://hal.archives-ouvertes.fr/hal-00141650
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES, Parallel Processing Letters, vol.30, issue.02, pp.173-193, 2011. ,
DOI : 10.1016/j.jcp.2004.10.011
Quicksched: Task-based parallelism with dependencies and conflicts, 2016. ,
Nested Parallelism: Allocation of Threads to Tasks and OpenMP Implementation, Scientific Programming, pp.185-194, 2001. ,
DOI : 10.1155/2001/821575
Flexible control structures for parallelism in OpenMP, Concurrency: Practice and Experience, pp.1219-1239, 2000. ,
DOI : 10.1109/TC.1987.5009478
Compiler support of the workqueuing execution model for intel smp architectures, Fourth European Workshop on OpenMP, 2002. ,
The Design of OpenMP Tasks, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.3, pp.404-418, 2009. ,
DOI : 10.1109/TPDS.2008.105
An Experimental Evaluation of the New OpenMP Tasking Model, Languages and Compilers for Parallel Computing, pp.63-77, 2007. ,
DOI : 10.1007/978-3-540-85261-2_5
Evaluation of openmp task scheduling strategies, " in OpenMP in a new era of parallelism, pp.100-110, 2008. ,
Evaluating openmp 3.0 run time systems on unbalanced task graphs, " in Evolving OpenMP in an Age of Extreme Parallelism, pp.63-78, 2009. ,
UTS: An Unbalanced Tree Search Benchmark, Languages and Compilers for Parallel Computing, pp.235-250, 2006. ,
DOI : 10.1007/978-3-540-72521-3_18
URL : http://people.eecs.ku.edu/~jhuan/papers/lcpc06.pdf
Reducers and other Cilk++ hyperobjects, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA '09, pp.79-90, 2009. ,
DOI : 10.1145/1583991.1584017
Scheduling multithreaded computations by work stealing, Journal of the ACM, vol.46, issue.5, pp.720-748, 1999. ,
DOI : 10.1145/324133.324234
URL : http://csdl.computer.org/comp/proceedings/sfcs/1994/6580/00/0365680.pdf
Race detectors for cilk and cilk++ programs, Encyclopedia of Parallel Computing, pp.1706-1719, 2011. ,
The Cilkview scalability analyzer, Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, SPAA '10, pp.145-156, 2010. ,
DOI : 10.1145/1810479.1810509
URL : http://www.csd.uwo.ca/~moreno/CS433-CS9624/Resources/p145-he.pdf
The Cilkprof Scalability Profiler, Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA '15, pp.89-100, 2015. ,
DOI : 10.1145/1594835.1504210
URL : http://dspace.mit.edu/bitstream/1721.1/113050/1/Leiserson_The%20cilkprof.pdf
A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops, IEEE Software, vol.28, issue.1, p.39, 2011. ,
DOI : 10.1109/MS.2011.2
Iterative methods for sparse linear systems. Siam Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pp.233-244, 2003. ,
A computer oriented geodetic data base and a new technique in file sequencing, International Business Machines Company, 1966. ,
Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication, Parallel & Distributed Processing Symposium (IPDPS), pp.721-733, 2011. ,
When cache blocking of sparse matrix vector multiply works and why, Applicable Algebra in Engineering, Communication and Computing, vol.18, issue.3, pp.297-311, 2007. ,
DOI : 10.1007/s00200-007-0038-9
URL : http://bebop.cs.berkeley.edu/pubs/nishtala2007-cb-spmv.pdf
Sparsity: Optimization Framework for Sparse Matrix Kernels, The International Journal of High Performance Computing Applications, vol.18, issue.1, pp.135-158, 2004. ,
DOI : 10.1007/BF01388687
URL : http://jsbach.kookmin.ac.kr/ejim/papers/ijhpca04.pdf
Utilizing recursive storage in sparse matrix-vector multiplication-preliminary considerations, CATA, pp.300-305, 2010. ,
On BLAS Operations with Recursively Stored Sparse Matrices, 2010 12th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp.49-56, 2010. ,
DOI : 10.1109/SYNASC.2010.72
Use of hybrid recursive CSR/- COO data structures in sparse matrices-vector multiplication, International Multiconference on Computer Science and Information Technology -IMCSIT, pp.327-335, 2010. ,
Assembling recursively stored sparse matrices, Proceedings of the International Multiconference on Computer Science and Information Technology, pp.317-325, 2010. ,
DOI : 10.1109/IMCSIT.2010.5680036
URL : http://www.proceedings2010.imcsit.org/pliks/205.pdf
Optimizing sparse matrix-vector multiplication using index and value compression, Proceedings of the 2008 conference on Computing frontiers , CF '08, pp.87-96, 2008. ,
DOI : 10.1145/1366230.1366244
URL : http://www.cslab.ece.ntua.gr/~nkoziris/papers/cf08-spmv-kkourt.pdf
A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units, SIAM Journal on Scientific Computing, vol.36, issue.5, pp.401-423, 2014. ,
DOI : 10.1137/130930352
Solving elliptic problems using ELLPACK, 2012. ,
DOI : 10.1007/978-1-4612-5018-0
Manycore performanceportability: Kokkos multidimensional array library, Scientific Programming, pp.89-114, 2012. ,
DOI : 10.1155/2012/917630
URL : http://doi.org/10.1155/2012/917630
An overview of the Trilinos project, ACM Transactions on Mathematical Software, vol.31, issue.3, pp.397-423, 2005. ,
DOI : 10.1145/1089014.1089021
A Cartesian Parallel Nested Dissection Algorithm, SIAM Journal on Matrix Analysis and Applications, vol.16, issue.1, pp.235-253, 1995. ,
DOI : 10.1137/S0895479892238270
A unified geometric approach to graph separators, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science, pp.538-547, 1991. ,
DOI : 10.1109/SFCS.1991.185417
A Partitioning Strategy for Nonuniform Problems on Multiprocessors, IEEE Transactions on Computers, vol.36, issue.5, pp.570-580, 1987. ,
DOI : 10.1109/TC.1987.1676942
Performance of dynamic load balancing algorithms for unstructured mesh calculations, Concurrency: Practice and experience, pp.457-481, 1991. ,
DOI : 10.1007/978-1-4613-1627-5
A Tool for Partitioning Structured Multiblock Meshes for Parallel Computational Mechanics, The International Journal of Supercomputer Applications and High Performance Computing, vol.20, issue.4, pp.336-343, 1997. ,
DOI : 10.1007/BF01933580
Parallel mesh adaptation using parallel graph partitioning, 5th European Conference on Computational Mechanics (ECCM V), pp.2612-2623, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01099259
PT-Scotch: A tool for efficient parallel graph ordering, Parallel Computing, vol.34, issue.6-8, pp.318-331, 2008. ,
DOI : 10.1016/j.parco.2007.12.001
URL : https://hal.archives-ouvertes.fr/hal-00402893
Anisotropic Delaunay Mesh Adaptation for Unsteady Simulations, Proceedings of the 17th international Meshing Roundtable, pp.177-194, 2008. ,
DOI : 10.1007/978-3-540-87921-3_11
URL : https://hal.archives-ouvertes.fr/hal-00353786
Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, vol.2, issue.2-3, pp.135-148, 1991. ,
DOI : 10.1016/0956-0521(91)90014-V
Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems, Concurrency: Practice and experience, pp.101-117, 1994. ,
DOI : 10.1002/j.1538-7305.1970.tb01770.x
Parallel multilevel k-way partitioning scheme for irregular graphs, department of computer science, pp.96-132, 1996. ,
Metis?unstructured graph partitioning and sparse matrix ordering system, version 2.0, 1995. ,
Parallel multilevel k-way partitioning scheme for irregular graphs, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '96, pp.96-129, 1998. ,
DOI : 10.1145/369028.369103
URL : http://www.cs.umn.edu/~kumar/papers/mlevel_kparallel.ps
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM Journal on Scientific Computing, vol.20, issue.1, pp.359-392, 1998. ,
DOI : 10.1137/S1064827595287997
URL : http://glaros.dtc.umn.edu/gkhome/fetch/papers/mlSIAMSC99.pdf
A multilevel algorithm for partitioning graphs, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '95, p.28, 1995. ,
DOI : 10.1145/224170.224228
A heuristic for reducing fill-in in sparse matrix factorization, Society for Industrial and Applied Mathematics (SIAM), 1993. ,
An Efficient Heuristic Procedure for Partitioning Graphs, Bell System Technical Journal, vol.49, issue.2, pp.291-307, 1970. ,
DOI : 10.1002/j.1538-7305.1970.tb01770.x
Multi-threaded Graph Partitioning, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.225-236, 2013. ,
DOI : 10.1109/IPDPS.2013.50
Parmetis, " Parallel graph partitioning and sparse matrix ordering library, Version, vol.2, 2003. ,
DOI : 10.1006/jpdc.1997.1403
URL : http://www.cs.umn.edu/~kumar/papers/mlevel_parallel.ps
Optimization of sparse matrix???vector multiplication on emerging multicore platforms, Parallel Computing, vol.35, issue.3, pp.178-194, 2009. ,
DOI : 10.1016/j.parco.2008.12.006
High performance parallel computing of flows in complex geometries: I. Methods, Computational Science & Discovery, vol.2, issue.1, p.15003, 2009. ,
DOI : 10.1088/1749-4699/2/1/015003
Parallel design and performance of nested filtering factorization preconditioner Storage and Analysis, ser. SC '13, Proceedings of the International Conference on High Performance Computing, Networking, pp.811-8112, 2013. ,
Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.945-955, 2014. ,
DOI : 10.1109/SC.2014.82
Analysis of precision differences observed for the avbp code, Tech. Rep, 2003. ,
ColPack, ACM Transactions on Mathematical Software, vol.40, issue.1, p.1, 2013. ,
DOI : 10.1145/2513109.2513110
Thread parallelism for highly irregular computation in anisotropic mesh adaptation, Proceedings of the 3rd International Conference on Exascale Applications and Software, pp.103-108, 2015. ,
Scalable parallel graph coloring algorithms, Concurrency - Practice and Experience, pp.1131-1146, 2000. ,
DOI : 10.1006/jpdc.1996.0117
URL : http://www.cs.odu.edu/~assefaw/pub/cpe-color.ps
Graph coloring algorithms for multi-core and massively multithreaded architectures, Parallel Computing, vol.38, issue.10-11, pp.576-594, 2012. ,
DOI : 10.1016/j.parco.2012.07.001
A fast and scalable graph coloring algorithm for multicore and many-core architectures, Euro-Par 2015: Parallel Processing, pp.414-425, 2015. ,
DOI : 10.1007/978-3-662-48096-0_32
URL : http://arxiv.org/pdf/1505.04086
Divide-and-Conquer for Parallel Processing, IEEE Transactions on Computers, vol.32, issue.6, pp.582-585, 1983. ,
DOI : 10.1109/TC.1983.1676280
Quicksort, The Computer Journal, vol.5, issue.1, pp.10-16, 1962. ,
DOI : 10.1093/comjnl/5.1.10
An algorithm for computing the mixed radix fast Fourier transform, IEEE Transactions on Audio and Electroacoustics, vol.17, issue.2, pp.93-103, 1969. ,
DOI : 10.1109/TAU.1969.1162042
Binary Mesh Partitioning for Cache-Efficient Visualization, IEEE Transactions on Visualization and Computer Graphics, vol.16, issue.5, pp.815-828, 2010. ,
DOI : 10.1109/TVCG.2010.19
URL : https://hal.archives-ouvertes.fr/hal-00685930
Cache-efficient parallel isosurface extraction for shared cache multicores, EGPGV, pp.81-90, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00798445
A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores, Highly Parallel Processing on a Chip (HPPC), 2010. ,
DOI : 10.1145/1693453.1693482
Scalable and composable shared memory parallelism with tasks for multicore and manycore, Exascale Challenges Workshop, TERATEC Forum, 2012. ,
Recursive Approach in Sparse Matrix LU Factorization, Scientific Programming, vol.9, issue.1, pp.51-60, 2001. ,
DOI : 10.1155/2001/569670
URL : https://doi.org/10.1155/2001/569670
A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009. ,
DOI : 10.1016/j.parco.2008.10.002
Cache-oblivious algorithms, Foundations of Computer Science 40th Annual Symposium on. IEEE, pp.285-297, 1999. ,
DOI : 10.1109/sffcs.1999.814600
Auto-blocking matrix-multiplication or tracking blas3 performance from source code, ACM SIGPLAN Notices, pp.206-216, 1997. ,
DOI : 10.1145/263767.263789
Recursive array layouts and fast matrix multiplication Parallel and Distributed Systems, IEEE Transactions on, vol.13, issue.11, pp.1105-1123, 2002. ,
DOI : 10.1109/tpds.2002.1058095
URL : http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA440384&Location=U2&doc=GetTRDoc.pdf
Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software, SIAM Review, vol.46, issue.1, pp.3-45, 2004. ,
DOI : 10.1137/S0036144503428693
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems, SIAM Journal on Scientific Computing, vol.34, issue.2, pp.70-82, 2012. ,
DOI : 10.1137/100806783
Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, p.12037, 2009. ,
DOI : 10.1088/1742-6596/180/1/012037
URL : http://iopscience.iop.org/article/10.1088/1742-6596/180/1/012037/pdf
A Scalable Parallel Assembly for Irregular Meshes Based on a Block Distribution for a Parallel Block Direct Solver, Applied Parallel Computing. New Paradigms for HPC in Industry and Academia, pp.113-120, 2000. ,
DOI : 10.1007/3-540-70734-4_15
Communication complexity for parallel divide-and-conquer, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science, pp.151-162, 1991. ,
DOI : 10.1109/SFCS.1991.185364
URL : http://www.cs.cmu.edu/afs/cs/project/cmcl/archive/Nectar-papers/91focs.ps
Advanced compiler optimizations for supercomputers An evaluation of vectorizing compilers, Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, pp.1184-1201, 1986. ,
Vectorizing unstructured mesh computations for many-core architectures, Concurrency and Computation: Practice and Experience, 2015. ,
DOI : 10.1002/cpe.3621
URL : http://www.oerc.ox.ac.uk/sites/default/files/uploads/profile-pages/Gihan/p39-reguly.pdf
OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures, 2012 Innovative Parallel Computing (InPar), pp.1-12, 2012. ,
DOI : 10.1109/InPar.2012.6339594
A Parallel Framework for Unstructured Grid Solvers, 1994. ,
DOI : 10.1007/978-3-0348-8534-8_10
URL : http://www-sccm.stanford.edu/~burgess/papers/OPlus.ps.gz
Multigrid aircraft computations using the oplus parallel library, " in Parallel Computational Fluid Dynamics: Implementation and Results using Parallel Computers, Proceedings Parallel CFD, vol.95, pp.339-346, 1996. ,
Cache-efficient renumbering for vectorization, International Journal for Numerical Methods in Biomedical Engineering, vol.26, issue.5, pp.628-636, 2010. ,
Reducing the bandwidth of sparse symmetric matrices, Proceedings of the 1969 24th national conference on -, pp.157-172, 1969. ,
DOI : 10.1145/800195.805928
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations, SIAM Review, vol.44, issue.3, pp.373-393, 2002. ,
DOI : 10.1137/S00361445003820
Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms, 2000. ,
Self-avoiding walks over adaptive unstructured grids, Concurrency: Practice and Experience, pp.85-109, 2000. ,
DOI : 10.1007/bfb0097981
Cache-Oblivious Sparse Matrix???Vector Multiplication by Using Sparse Matrix Partitioning Methods, SIAM Journal on Scientific Computing, vol.31, issue.4, pp.3128-3154, 2009. ,
DOI : 10.1137/080733243
URL : http://www.math.uu.nl/people/bisseling/Mondriaan/yzelman09.pdf
Space-filling curves, 2012. ,
DOI : 10.1007/978-1-4612-0871-6
Two-dimensional cache-oblivious sparse matrix???vector multiplication, Parallel Computing, vol.37, issue.12, pp.806-819, 2011. ,
DOI : 10.1016/j.parco.2011.08.004
Nested Dissection of a Regular Finite Element Mesh, SIAM Journal on Numerical Analysis, vol.10, issue.2, pp.345-363, 1973. ,
DOI : 10.1137/0710032
Nested-Dissection Orderings for Sparse LU with Partial Pivoting, SIAM Journal on Matrix Analysis and Applications, vol.23, issue.4, pp.998-1012, 2002. ,
DOI : 10.1137/S0895479801385037
URL : http://www.math.tau.ac.il/~sivan/Pubs/wide.pdf
Toward application-specific memory reconfiguration for energy efficiency, Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, E2SC '13, p.2, 2013. ,
DOI : 10.1145/2536430.2536434
Improving performance via mini-applications, Sandia National Laboratories, vol.3, 2009. ,
On the role of co-design in high performance computing, pp.141-155, 2013. ,
Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs, International Journal of Parallel Programming, vol.11, issue.1, pp.5-6, 2010. ,
DOI : 10.1007/s10766-010-0140-7
Less hazardous and more scientific research for summation algorithm computing times, 2012. ,
URL : https://hal.archives-ouvertes.fr/lirmm-00737617
Barra: A Parallel Functional Simulator for GPGPU, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp.351-360, 2010. ,
DOI : 10.1109/MASCOTS.2010.43