Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, 2012. ,
, NVIDIA cuDNN GPU accelerated deep learning
Vobla: A vehicle for optimized basic linear algebra, ACM SIGPLAN Notices, vol.49, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01508181
SPIRAL: Code generation for DSP transforms, IEEE, vol.93, issue.2, 2005. ,
A basic linear algebra compiler, Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014. ,
LIFT: A functional data-parallel IR for high-performance GPU code generation, Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. ,
Superoptimizer: a look at the smallest program, In ACM SIGARCH Computer Architecture News, vol.15, pp.122-126, 1987. ,
Denali: a goal-directed superoptimizer, vol.37, 2002. ,
Scaling up superoptimization, ACM SIGARCH Computer Architecture News, vol.44, pp.297-310, 2016. ,
Handbook of constraint programming, 2006. ,
Constraint propagation -models, techniques, implementation, 2009. ,
Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models, Proceedings of the IEEE, vol.93, issue.2, pp.276-292, 2005. ,
Algorithmic skeletons: structured management of parallel computation, 1989. ,
Thrust: A productivity-oriented library for cuda, GPU computing gems Jade edition, pp.359-371, 2011. ,
Skelcl-a portable skeleton library for high-level gpu programming, Parallel and Distributed Processing Workshops and Phd Forum, pp.1176-1182, 2011. ,
Delite: A compiler architecture for performance-oriented embedded domainspecific languages, ACM Transactions on Embedded Computing Systems (TECS), vol.13, issue.4s, p.134, 2014. ,
Copperhead: compiling an embedded data parallel language, ACM SIGPLAN Notices, vol.46, issue.8, pp.47-56, 2011. ,
Accelerating haskell array codes with multicore gpus, Proceedings of the sixth workshop on Declarative aspects of multicore programming, pp.3-14, 2011. ,
Nova: A functional language for data parallelism, Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, 2014. ,
Combining analyses, combining optimizations, ACM Transactions on Programming Languages and Systems, vol.17, issue.2, pp.181-196, 1995. ,
Optimization space pruning without regrets, Proceedings of the 26th International Conference on Compiler Construction, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01655602
Equality saturation: a new approach to optimization, In ACM SIGPLAN Notices, vol.44, pp.264-276, 2009. ,
Decoupling algorithms from schedules for easy optimization of image processing pipelines, 2012. ,
TVM: end-to-end optimization stack for deep learning, 2018. ,
A framework for unifying reordering transformations, 1998. ,
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies, International Journal of Parallel Programming, vol.34, issue.3, pp.261-317, 2006. ,
Some efficient solutions to the affine scheduling problem. i. one-dimensional time, International journal of parallel programming, vol.21, issue.5, pp.313-347, 1992. ,
Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. International journal of parallel programming, vol.21, pp.389-420, 1992. ,
Chlorophyll: Synthesis-aided compiler for low-power spatial architectures, ACM SIGPLAN Notices, vol.49, pp.396-407, 2014. ,
Combinatorial sketching for finite programs, ACM Sigplan Notices, vol.41, issue.11, pp.404-415, 2006. ,
A practical automatic polyhedral parallelizer and locality optimizer, Acm Sigplan Notices, vol.43, pp.101-113, 2008. ,
Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling, Proceedings of the 27th International Conference on Compiler Construction (CC), 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01751823
Iterative optimization in the polyhedral model: part ii, multidimensional time, Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, pp.90-100, 2008. ,
URL : https://hal.archives-ouvertes.fr/hal-01257273
Loop transformations: convexity, pruning and optimization, In ACM SIGPLAN Notices, vol.46, pp.549-562, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00551077
Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, 2018. ,
Scalable parallel programming with CUDA, ACM Queue, vol.6, issue.2, pp.40-53, 2008. ,
OpenCL: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, vol.12, issue.1-3, pp.66-73, 2010. ,
Openmp: An industry-standard api for shared-memory programming, IEEE Comput. Sci. Eng, vol.5, issue.1, pp.46-55, 1998. ,
Automatically Tuned Linear Algebra Software, Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999. ,
Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.232-240, 2010. ,
A note on auto-tuning GEMM for GPUs, International Conference on Computational Science, ICCS'09, 2009. ,
José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for CUDA, ACM Trans. Archit. Code Optim, vol.9, issue.4, 2013. ,
Diesel: Dsl for linear algebra and neural net computations on gpus, Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp.42-51, 2018. ,
Roofline: An insightful visual performance model for multicore architectures, Commun. ACM, 2009. ,
Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp.1-10, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00789958
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, In ACM SIGARCH Computer Architecture News, vol.37, pp.152-163, 2009. ,
An adaptive performance modeling tool for GPU architectures, Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), vol.45, pp.105-114, 2010. ,
Adaptive input-aware compilation for graphics engines, ACM SIGPLAN Notices, vol.47, issue.6, pp.13-22, 2012. ,
Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks, 2018. ,
Bandit-based optimization on graphs with application to library performance tuning, 2009. ,
, NVIDIA cuBLAS GPU accelerated linear algebra
Estimating search tree size, Proceedings of the 21st National Conference on Artificial Intelligence, vol.2, pp.1014-1019, 2006. ,
Heuristic sampling: A method for predicting the performance of tree searching programs, SIAM Journal on Computing, vol.21, issue.2, pp.295-315, 1992. ,
Estimating the efficiency of backtrack programs, Mathematics of Computation, vol.29, issue.129, pp.121-136, 1975. ,
Optimizing for reduced code space using genetic algorithms, In ACM SIGPLAN Notices, vol.34, pp.1-9, 1999. ,
Finite-time analysis of the multiarmed bandit problem, Machine learning, vol.47, issue.2-3, pp.235-256, 2002. ,
Mastering the game of go without human knowledge, Nature, vol.550, issue.7676, p.354, 2017. ,
Combining analyses, combining optimizations, ACM Transactions on Programming Languages and Systems, p.17, 1995. ,
The program dependence graph and its use in optimization, ACM Trans. Program. Lang. Syst, vol.9, 1987. ,
The semantics of program dependence, Conference on Programming Language Design and Implementation, PLDI '89, 1989. ,
The program dependence web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages, Conference on Programming Language Design and Implementation, PLDI '90, 1990. ,
, From the definition of f and g, these properties are already respected for (f (p 1 ), g(I 1 )) and for (f
, We thus only need to prove that there exists a path and that its internal execution point respect the properties. For that, we consider the different cases of the definition of the edge from
,
, If there exists i < n such that L = {d i }, then I 2 (L) = size min ({d i }) ? 1. Indeed, p 2 is nested inside d i thus E di ? p 2 . For the same reason, I 1 ({d i }) = size min ({d i })?1. Moreover, constraints force dimensions on which point to point communication occurs to have the same size. Thus, size min ({d i }) = size min ({d i }). Thus, Existence of a Path from (E L , I j 2 ) to (p 2 , I 2 ), Relative Order of I 1 , I 1 and 2 . Let L ? L
I j 2 (L ) = I 1 (L ) ? I 2 (L ) ,