A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, 2012.

, NVIDIA cuDNN GPU accelerated deep learning

U. Beaugnon, A. Kravets, S. Van-haastregt, R. Baghdadi, D. Tweed et al., Vobla: A vehicle for optimized basic linear algebra, ACM SIGPLAN Notices, vol.49, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01508181

M. Puschel, M. F. José, J. R. Moura, D. Johnson, M. M. Padua et al., SPIRAL: Code generation for DSP transforms, IEEE, vol.93, issue.2, 2005.

G. Daniele, M. Spampinato, and . Püschel, A basic linear algebra compiler, Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014.

M. Steuwer, T. Remmelg, and C. Dubach, LIFT: A functional data-parallel IR for high-performance GPU code generation, Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017.

H. Massalin, Superoptimizer: a look at the smallest program, In ACM SIGARCH Computer Architecture News, vol.15, pp.122-126, 1987.

R. Joshi, G. Nelson, and K. Randall, Denali: a goal-directed superoptimizer, vol.37, 2002.

A. Phitchaya-mangpo-phothilimthana, R. Thakur, D. Bodik, and . Dhurjati, Scaling up superoptimization, ACM SIGARCH Computer Architecture News, vol.44, pp.297-310, 2016.

F. Rossi, P. Van-beek, and T. Walsh, Handbook of constraint programming, 2006.

G. Tack, Constraint propagation -models, techniques, implementation, 2009.

G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella et al., Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models, Proceedings of the IEEE, vol.93, issue.2, pp.276-292, 2005.

. Murray-i-cole, Algorithmic skeletons: structured management of parallel computation, 1989.

N. Bell and J. Hoberock, Thrust: A productivity-oriented library for cuda, GPU computing gems Jade edition, pp.359-371, 2011.

M. Steuwer, P. Kegel, and S. Gorlatch, Skelcl-a portable skeleton library for high-level gpu programming, Parallel and Distributed Processing Workshops and Phd Forum, pp.1176-1182, 2011.

K. Arvind, K. J. Sujeeth, H. Brown, T. Lee, H. Rompf et al., Delite: A compiler architecture for performance-oriented embedded domainspecific languages, ACM Transactions on Embedded Computing Systems (TECS), vol.13, issue.4s, p.134, 2014.

B. Catanzaro, M. Garland, and K. Keutzer, Copperhead: compiling an embedded data parallel language, ACM SIGPLAN Notices, vol.46, issue.8, pp.47-56, 2011.

M. T. Manuel, G. Chakravarty, S. Keller, . Lee, L. Trevor et al., Accelerating haskell array codes with multicore gpus, Proceedings of the sixth workshop on Declarative aspects of multicore programming, pp.3-14, 2011.

A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea, Nova: A functional language for data parallelism, Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, 2014.

C. Click, D. Keith, and . Cooper, Combining analyses, combining optimizations, ACM Transactions on Programming Languages and Systems, vol.17, issue.2, pp.181-196, 1995.

U. Beaugnon, A. Pouille, M. Pouzet, J. Pienaar, and A. Cohen, Optimization space pruning without regrets, Proceedings of the 26th International Conference on Compiler Construction, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01655602

R. Tate, M. Stepp, Z. Tatlock, and S. Lerner, Equality saturation: a new approach to optimization, In ACM SIGPLAN Notices, vol.44, pp.264-276, 2009.

J. Ragan-kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe et al., Decoupling algorithms from schedules for easy optimization of image processing pipelines, 2012.

T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan et al., TVM: end-to-end optimization stack for deep learning, 2018.

W. Kelly and W. Pugh, A framework for unifying reordering transformations, 1998.

N. Sylvain-girbal, C. Vasilache, A. Bastoul, D. Cohen, M. Parello et al., Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies, International Journal of Parallel Programming, vol.34, issue.3, pp.261-317, 2006.

P. Feautrier, Some efficient solutions to the affine scheduling problem. i. one-dimensional time, International journal of parallel programming, vol.21, issue.5, pp.313-347, 1992.

P. Feautrier, Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. International journal of parallel programming, vol.21, pp.389-420, 1992.

T. Phitchaya-mangpo-phothilimthana, R. Jelvis, N. Shah, S. Totla, R. Chasins et al., Chlorophyll: Synthesis-aided compiler for low-power spatial architectures, ACM SIGPLAN Notices, vol.49, pp.396-407, 2014.

A. Solar-lezama, L. Tancau, R. Bodik, S. Seshia, and V. Saraswat, Combinatorial sketching for finite programs, ACM Sigplan Notices, vol.41, issue.11, pp.404-415, 2006.

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer, Acm Sigplan Notices, vol.43, pp.101-113, 2008.

O. Zinenko, S. Verdoolaege, C. Reddy, J. Shirako, T. Grosser et al., Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling, Proceedings of the 27th International Conference on Compiler Construction (CC), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01751823

L. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos, Iterative optimization in the polyhedral model: part ii, multidimensional time, Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, pp.90-100, 2008.
URL : https://hal.archives-ouvertes.fr/hal-01257273

L. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam et al., Loop transformations: convexity, pruning and optimization, In ACM SIGPLAN Notices, vol.46, pp.549-562, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00551077

N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. Devito et al., Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, 2018.

J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with CUDA, ACM Queue, vol.6, issue.2, pp.40-53, 2008.

E. John, D. Stone, G. Gohara, and . Shi, OpenCL: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, vol.12, issue.1-3, pp.66-73, 2010.

L. Dagum and R. Menon, Openmp: An industry-standard api for shared-memory programming, IEEE Comput. Sci. Eng, vol.5, issue.1, pp.46-55, 1998.

R. , C. Whaley, and J. Dongarra, Automatically Tuned Linear Algebra Software, Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999.

S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.232-240, 2010.

Y. Li, J. Dongarra, and S. Tomov, A note on auto-tuning GEMM for GPUs, International Conference on Computational Science, ICCS'09, 2009.

S. Verdoolaege, J. C. Juega, and A. Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for CUDA, ACM Trans. Archit. Code Optim, vol.9, issue.4, 2013.

V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover, Diesel: Dsl for linear algebra and neural net computations on gpus, Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp.42-51, 2018.

S. Williams, A. Waterman, and D. Patterson, Roofline: An insightful visual performance model for multicore architectures, Commun. ACM, 2009.

J. Lai and A. Seznec, Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp.1-10, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00789958

S. Hong and H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, In ACM SIGARCH Computer Architecture News, vol.37, pp.152-163, 2009.

S. Sara, M. Baghsorkhi, . Delahaye, J. Sanjay, . Patel et al., An adaptive performance modeling tool for GPU architectures, Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), vol.45, pp.105-114, 2010.

M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke, Adaptive input-aware compilation for graphics engines, ACM SIGPLAN Notices, vol.47, issue.6, pp.13-22, 2012.

C. Mendis, S. Amarasinghe, and M. Carbin, Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks, 2018.

A. Frédéric-de-mesmay, Y. Rimmel, M. Voronenko, and . Puschel, Bandit-based optimization on graphs with application to library performance tuning, 2009.

, NVIDIA cuBLAS GPU accelerated linear algebra

P. Kilby, J. Slaney, S. Thiébaux, and T. Walsh, Estimating search tree size, Proceedings of the 21st National Conference on Artificial Intelligence, vol.2, pp.1014-1019, 2006.

C. Pang and . Chen, Heuristic sampling: A method for predicting the performance of tree searching programs, SIAM Journal on Computing, vol.21, issue.2, pp.295-315, 1992.

D. E. Knuth, Estimating the efficiency of backtrack programs, Mathematics of Computation, vol.29, issue.129, pp.121-136, 1975.

D. Keith, . Cooper, J. Philip, D. Schielke, and . Subramanian, Optimizing for reduced code space using genetic algorithms, In ACM SIGPLAN Notices, vol.34, pp.1-9, 1999.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine learning, vol.47, issue.2-3, pp.235-256, 2002.

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang et al., Mastering the game of go without human knowledge, Nature, vol.550, issue.7676, p.354, 2017.

C. Click, D. Keith, and . Cooper, Combining analyses, combining optimizations, ACM Transactions on Programming Languages and Systems, p.17, 1995.

J. Ferrante, K. J. Ottenstein, and J. D. Warren, The program dependence graph and its use in optimization, ACM Trans. Program. Lang. Syst, vol.9, 1987.

R. Cartwright and M. Felleisen, The semantics of program dependence, Conference on Programming Language Design and Implementation, PLDI '89, 1989.

K. J. Ottenstein, R. A. Ballance, and A. B. Maccabe, The program dependence web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages, Conference on Programming Language Design and Implementation, PLDI '90, 1990.

, From the definition of f and g, these properties are already respected for (f (p 1 ), g(I 1 )) and for (f

, We thus only need to prove that there exists a path and that its internal execution point respect the properties. For that, we consider the different cases of the definition of the edge from

. Then,

, If there exists i < n such that L = {d i }, then I 2 (L) = size min ({d i }) ? 1. Indeed, p 2 is nested inside d i thus E di ? p 2 . For the same reason, I 1 ({d i }) = size min ({d i })?1. Moreover, constraints force dimensions on which point to point communication occurs to have the same size. Thus, size min ({d i }) = size min ({d i }). Thus, Existence of a Path from (E L , I j 2 ) to (p 2 , I 2 ), Relative Order of I 1 , I 1 and 2 . Let L ? L

?. Otherwise, I j 2 (L ) = I 1 (L ) ? I 2 (L )