A. Aggarwal and M. Franklin, An empirical study of the scalability aspects of instruction distribution algorithms for clustered processors, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS., pp.172-179
DOI : 10.1109/ISPASS.2001.990696

A. Agarwal, B. Lim, D. Kranz, and J. Kubiatowicz, April: a processor architecture for multiprocessing, Computer Architecture Proceedings., 17th Annual International Symposium on, pp.104-114, 1990.
DOI : 10.1109/isca.1990.134498

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.2437

M. Gene and . Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the, pp.483-485, 1967.

B. Tamar, Migrating from sse2 vector operations to avx2 vector operations, 2014.

[. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, Bulldozer: An Approach to Multithreaded Compute Performance, IEEE Micro, vol.31, issue.2, pp.6-15, 2011.
DOI : 10.1109/MM.2011.23

M. Bach, M. Charney, R. Cohn, E. Demikhovsky, T. Devor et al., Analyzing Parallel Programs with Pin, Computer, vol.43, issue.3, pp.34-41, 2010.
DOI : 10.1109/MC.2010.60

[. Brunie, S. Collange, and G. Diamos, Simultaneous branch and warp interweaving for sustained GPU performance
DOI : 10.1145/2366231.2337166

URL : https://hal.archives-ouvertes.fr/ensl-00649650

[. Balasubramonian, S. Dwarkadas, H. David, and . Albonesi, Dynamically managing the communication-parallelism tradeoff in future clustered processors, ACM SIGARCH Computer Architecture News, pp.49-60, 2003.

M. John, . Borkenhagen, J. Richard, . Eickemeyer, N. Ronald et al., A multithreaded powerpc processor for commercial servers, IBM Journal of Research and Development, vol.44, issue.6, pp.885-898, 2000.

[. Branover, D. Foley, and M. Steinman, AMD Fusion APU: Llano, IEEE Micro, vol.32, issue.2, pp.28-37, 2012.
DOI : 10.1109/MM.2012.2

[. Bienia, S. Kumar, K. Singh, and . Li, The PARSEC benchmark suite, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.72-81, 2008.
DOI : 10.1145/1454115.1454128

A. Baniasadi and A. Moshovos, Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors, Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pp.337-347, 2000.
DOI : 10.1109/micro.2000.898083

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.3107

R. David-budde, . Riches, T. Michael, G. Imel, K. Myers et al., Register scorboarding on a microprocessor chip, US Patent, vol.4891, p.753, 1990.

M. Shuai-che, J. Boyer, D. Meng, J. W. Tarjan, S. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, Workload Characterization IEEE International Symposium on, pp.44-54, 2009.

D. Sylvain-collange, Y. Defour, and . Zhang, Dynamic detection of uniform and affine vectors in gpgpu computations [ci7] Intel ® core?i7-5960x processor extreme edition, European Conference on Parallel Processing, pp.46-55, 2009.

S. Collange, Stack-less simt reconvergence at low cost, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00622654

[. Canal, J. M. Parcerisa, and A. González, Dynamic cluster assignment mechanisms, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), pp.133-142, 2000.
DOI : 10.1109/HPCA.2000.824345

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.8960

F. J. Cazorla, A. Ramírez, M. Valero, and E. Fernández, Dynamically Controlled Resource Allocation in SMT Processors, 37th International Symposium on Microarchitecture (MICRO-37'04), pp.171-182, 2004.
DOI : 10.1109/MICRO.2004.17

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.146.3226

D. Jamison, . Collins, M. Dean, and . Tullsen, Clustered multithreaded architectures-pursuing both ipc and cycle time, Parallel and Distributed Processing Symposium Proceedings. 18th International, p.76, 2004.

[. Dechene, E. Forbes, and E. Rotenberg, Multithreaded instruction sharing, 2010.

A. Gregory-diamos, H. Kerr, S. Wu, and . Yalamanchili, Benjamin Ashbaugh, and Subramaniam Maiyuran. SIMD reconvergence at thread frontiers, MICRO 44: Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, 2011.

L. Dagum and R. Menon, OpenMP: an industry standard API for shared-memory programming, IEEE Computational Science and Engineering, vol.5, issue.1, pp.46-55, 1998.
DOI : 10.1109/99.660313

H. Robert, . Dennard, . Vl-rideout, A. Bassous, and . Leblanc, Design of ion-implanted mosfet's with very small physical dimensions. Solid- State Circuits, IEEE Journal, vol.9, issue.5, pp.256-268, 1974.

R. Dolbeau and A. Seznec, Cash: Revisiting hardware sharing in single-chip parallel processor, 2002.
URL : https://hal.archives-ouvertes.fr/inria-00071925

A. El-moursy and D. H. Albonesi, Front-end policies for improved issue efficiency in SMT processors, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pp.31-40, 2003.
DOI : 10.1109/HPCA.2003.1183522

T. Hadi-esmaeilzadeh, Y. Cao, . Xi, M. Stephen, K. S. Blackburn et al., Looking back on the language and hardware revolutions, ACM SIGARCH Computer Architecture News, vol.39, issue.1, pp.319-332, 2011.
DOI : 10.1145/1961295.1950402

[. Eyerman and L. Eeckhout, A memory-level parallelism aware fetch policy for SMT processors, 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), pp.240-249, 2007.

J. Richard, . Eickemeyer, E. Ross, . Johnson, R. Steven et al., Evaluation of multithreaded uniprocessors for commercial application environments, In ACM SIGARCH Computer Architecture News, vol.24, pp.203-212, 1996.

M. Firasta, P. Buxton, K. Jinbo, S. Nasri, and . Kuo, Intel avx: New frontiers in performance improvements and energy efficiency, 2008.

J. Michael and . Flynn, Very high-speed computing systems, Proceedings of the IEEE, vol.54, issue.12, pp.1901-1909, 1966.

J. Michael and . Flynn, Some computer organizations and their effectiveness . Computers, IEEE Transactions on, vol.100, issue.9, pp.948-960, 1972.

[. Foster, Designing and building parallel programs, 1995.

W. Wilson, I. Fung, G. Sham, . Yuan, M. Tor et al., Dynamic warp formation and scheduling for efficient gpu control flow, Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp.407-420, 2007.

W. L. Wilson, I. Fung, G. Sham, T. M. Yuan, and . Aamodt, Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware, ACM Trans. Archit. Code Optim, vol.67, pp.1-7, 2009.

[. González, Q. Cai, P. Chaparro, G. Magklis, R. Rakvic et al., Thread fusion, Proceeding of the thirteenth international symposium on Low power electronics and design, ISLPED '08, pp.363-368, 2008.
DOI : 10.1145/1393921.1394018

[. Greenhalgh, Big. little processing with arm cortex-a15 & cortex-a7, pp.1-8, 2011.

[. Hily and A. Seznec, Branch prediction and simultaneous multithreading, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, pp.169-173, 1996.
DOI : 10.1109/PACT.1996.552664

URL : https://hal.archives-ouvertes.fr/inria-00073847

[. Hily and A. Seznec, Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp.64-67, 1999.
DOI : 10.1109/HPCA.1999.744331

URL : https://hal.archives-ouvertes.fr/inria-00073298

®. Intel and . Guide, https://software.intel.com/sites/landingpage/IntrinsicsGuid Accessed, pp.2016-2017

[. Kongetira, K. Aingaran, and K. Olukotun, Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, vol.25, issue.2, pp.21-29, 2005.
DOI : 10.1109/MM.2005.35

[. Kalathingal and S. Collange, Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016.
DOI : 10.1109/SBAC-PAD.2016.11

URL : https://hal.archives-ouvertes.fr/hal-01356202

A. Klemm, X. Duran, H. Tian, D. Saito, X. Caballero et al., Extending OpenMP* with Vector Constructs for Modern Multicore SIMD Architectures, International Workshop on OpenMP, pp.59-72, 2012.
DOI : 10.1007/978-3-642-30961-8_5

E. Richard and . Kessler, The alpha 21264 microprocessor, Micro, IEEE, vol.19, issue.2, pp.24-36, 1999.

C. Bradley, . Kuszmaul, S. Dana, . Henry, H. Gabriel et al., A comparison of scalable superscalar processors, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, pp.126-137, 1999.

[. Kumar, P. Norman, . Jouppi, M. Dean, and . Tullsen, Conjoinedcore chip multiprocessing, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, pp.195-206, 2004.
DOI : 10.1109/micro.2004.12

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.7776

[. Keryell and N. Paris, Activity Counter: New Optimization for the dynamic scheduling of SIMD Control Flow, 1993 International Conference on Parallel Processing, ICPP'93 Vol2, pp.184-187, 1993.
DOI : 10.1109/ICPP.1993.36

[. Krishnan and J. Torrellas, A clustered approach to multithreaded processors, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pp.627-634, 1998.
DOI : 10.1109/IPPS.1998.669992

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.229.3093

[. Kumar, M. Dean, . Tullsen, P. Norman, P. Jouppi et al., Heterogeneous chip multiprocessors, Computer, vol.38, issue.11, pp.32-38, 2005.
DOI : 10.1109/MC.2005.379

B. David, W. Kirk, and . Hwu-wen-mei, Programming massively parallel processors: a hands-on approach, 2012.

J. Li, . Ho-ahn, D. Richard, . Strong, B. Jay et al., McPAT, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.469-480, 2009.
DOI : 10.1145/1669112.1669172

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser et al., Pin, ACM SIGPLAN Notices, vol.40, issue.6, pp.190-200, 2005.
DOI : 10.1145/1064978.1065034

[. Lee, Multimedia extensions for general-purpose processors, 1997 IEEE Workshop on Signal Processing Systems. SiPS 97 Design and Implementation formerly VLSI Signal Processing
DOI : 10.1109/SIPS.1997.625683

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.7303

D. Long, S. Franklin, P. Biswas, J. Ortiz, D. Oberg et al., Minimal multithreading: Finding and removing redundant instructions in multithreaded processors, Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp.337-348, 2010.
DOI : 10.1109/micro.2010.41

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.295.9883

[. Luo, M. Franklin, S. S. Mukherjee, and A. Seznec, Boosting SMT performance by speculation control, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, p.2, 2001.
DOI : 10.1109/IPDPS.2001.924929

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8887

[. Luo, J. Gummaraju, and M. Franklin, Balancing thoughput and fairness in SMT processors, IEEE International Symposium on Performance Analysis of Systems and Software, pp.164-171, 2001.

A. Lashgar, A. Khonsari, and A. Baniasadi, HARP, ACM Transactions on Embedded Computing Systems, vol.13, issue.3s, p.114, 2014.
DOI : 10.1007/s02011-011-1137-8

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol.28, issue.2, pp.39-55, 2008.
DOI : 10.1109/MM.2008.31

A. Levinthal and T. Porter, Chap - a SIMD graphics processor, ACM SIGGRAPH Computer Graphics, vol.18, issue.3, pp.77-82, 1984.
DOI : 10.1145/964965.808581

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.210.5888

A. Raymond, . Lorie, R. Hovey, and . Strong-jr, Method for conditional branch execution in simd vector processors, US Patent, vol.4435, p.758, 1984.

E. Gordon and . Moore, Cramming more components onto integrated circuits, Proceedings of the IEEE, pp.82-85, 1998.

C. Mcnairy and R. Bhatia, Montecito: A dual-core, dualthread itanium processor, IEEE micro, issue.2, pp.10-20, 2005.

M. Mckeown, J. Balkind, and D. Wentzlaff, Execution Drafting: Energy Efficiency through Computation Deduplication, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp.432-444, 2014.
DOI : 10.1109/MICRO.2014.43

T. Milanez, S. Collange, F. M. , Q. Pereira, W. Meira et al., Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads, Parallel Computing, vol.40, issue.9, pp.548-558, 2014.
DOI : 10.1016/j.parco.2014.03.006

URL : https://hal.archives-ouvertes.fr/hal-01087054

[. Menon, M. D. Kruijf, and K. Sankaralingam, iGPU, ACM SIGARCH Computer Architecture News, pp.72-83, 2012.
DOI : 10.1145/2366231.2337168

Y. Maleki, M. J. Gao, T. Garzaran, . Wong, A. David et al., An Evaluation of Vectorizing Compilers, 2011 International Conference on Parallel Architectures and Compilation Techniques, pp.372-382, 2011.
DOI : 10.1109/PACT.2011.68

[. Miles, B. Leback, and D. Norton, Optimizing application performance on x64 processor-based systems with pgi compilers and tools The Portland Group, 2007.

M. Moudgill, K. Pingali, and S. Vassiliadis, Register renaming and dynamic speculation: an alternative approach, Proceedings of the 26th Annual International Symposium on Microarchitecture, pp.202-213, 1993.
DOI : 10.1109/MICRO.1993.282756

[. Meng and K. Skadron, Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling, 2009 IEEE International Conference on Computer Design, pp.282-288, 2009.
DOI : 10.1109/ICCD.2009.5413143

[. Meng, D. Tarjan, and K. Skadron, Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.235-246, 2010.
DOI : 10.1145/1816038.1815992

[. Meng, D. Tarjan, and K. Skadron, Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.235-246, 2010.
DOI : 10.1145/1816038.1815992

A. Munshi, The OpenCL specification, 2009 IEEE Hot Chips 21 Symposium (HCS), pp.1-314, 2009.
DOI : 10.1109/HOTCHIPS.2009.7478342

D. Naishlos, Autovectorization in gcc, Proceedings of the 2004 GCC Developers Summit, pp.105-118, 2004.

J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with CUDA, Queue, vol.6, issue.2, pp.40-53, 2008.
DOI : 10.1145/1365490.1365500

J. Nickolls, J. William, and . Dally, The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010.
DOI : 10.1109/MM.2010.41

J. Nickolls and W. J. Dally, The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010.
DOI : 10.1109/MM.2010.41

[. Nvidia, Compute unified device architecture programming guide, 2007.

D. Nuzman and A. Zaks, Autovectorization in gcc?two years later, Proceedings of the 2006 GCC Developers Summit, pp.145-158, 2006.

[. Mike and O. Connor, Highlights of the high-bandwidth memory (hbm) standard, Memory Forum Workshop, 2014.

. Onh-+-96-]-kunle-olukotun, A. Basem, L. Nayfeh, K. Hammond, K. Wilson et al., The case for a single-chip multiprocessor, ACM Sigplan Notices, issue.9, pp.312-323, 1996.

. Pham, . Asano, . Bolliger, . Day, C. Hp-hofstee et al., The design and implementation of a first-generation cell processor-a multi-core soc, Integrated Circuit Design and Technology, 2005. ICICDT 2005. 2005 International Conference on, pp.49-52, 2005.

[. Greenhalgh, Big. little processing with arm cortex-a15 & cortex-a7, 2013.

A. David, . Patterson, L. John, and . Hennessy, Computer organization and design: the hardware/software interface, 2013.

[. Palacharla, P. Norman, . Jouppi, E. James, and . Smith, Complexity-effective superscalar processors, 1997.
DOI : 10.1145/384286.264201

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.527.5571

M. Pharr, R. William, and . Mark, ispc: A spmd compiler for highperformance cpu programming, Innovative Parallel Computing (InPar), 2012, pp.1-13, 2012.

A. David, M. J. Padua, and . Wolfe, Advanced compiler optimizations for supercomputers, Commun. ACM, vol.29, issue.12, pp.1184-1201, 1986.

A. Parulkar, . Wood, C. James, B. Hoe, . Falsafi et al., Opensparc: An open platform for hardware reliability experimentation, Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE). Citeseer, 2008.

J. Michael, . Quinn, J. Philip, . Hatcher, C. Karen et al., Compiling c* programs for a hypercube multicomputer, In ACM SIGPLAN Notices, vol.23, pp.57-65, 1988.

R. James, Intel ® avx-512 instructions, 2013.

R. James, Additional intel ® avx-512 instructions, 2014.

S. Rixner, J. William, B. Dally, P. Khailany, . Mattson et al., Register organization for media processing, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), pp.375-386, 2000.
DOI : 10.1109/HPCA.2000.824366

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.7602

R. M. Russell, The CRAY-1 computer system, Communications of the ACM, vol.21, issue.1, pp.63-72, 1978.
DOI : 10.1145/359327.359336

M. Shah, J. Barren, R. Brooks, G. Golla, N. Grohoski et al., Ultrasparc t2: A highly-treaded, power-efficient, sparc soc, Solid-State Circuits Conference ASSCC'07. IEEE Asian, pp.22-25, 2007.

[. Saavedra-barrera, D. Culler, and T. V. Eicken, Analysis of multithreaded architectures for parallel computing, Proceedings of the second annual ACM symposium on Parallel algorithms and architectures , SPAA '90, pp.169-178, 1990.
DOI : 10.1145/97444.97683

A. Seznec, A new case for the TAGE branch predictor, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pp.117-127, 2011.
DOI : 10.1145/2155620.2155635

URL : https://hal.archives-ouvertes.fr/hal-00639193

A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, Design tradeoffs for the alpha EV8 conditional branch predictor, 29th International Symposium on Computer Architecture, pp.25-29, 2002.

E. John, D. Stone, G. Gohara, and . Shi, Opencl: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, vol.12, issue.1-3, pp.66-73, 2010.

[. Srinivas, K. Raman, and V. Pentkovski, Implementing streaming simd extensions on thethepentium iii processor, 2000.

J. Burton and . Smith, Architecture and applications of the hep multiprocessor computer system, 25th Annual Technical Symposium, pp.241-248, 1982.

[. Takahashi, A mechanism for SIMD execution of SPMD programs, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97, pp.529-534, 1997.
DOI : 10.1109/HPC.1997.592203

[. Thekkath, J. Susan, and . Eggers, The effectiveness of multiple hardware contexts, ACM SIGPLAN Notices, vol.29, issue.11, pp.328-337, 1994.
DOI : 10.1145/195470.195583

M. Dean, S. J. Tullsen, J. S. Eggers, H. M. Emer, J. L. Levy et al., Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp.191-202, 1996.

M. Dean, . Tullsen, J. Susan, . Eggers, S. Joel et al., Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor, In ACM SIGARCH Computer Architecture News, vol.24, pp.191-202, 1996.

M. Dean, . Tullsen, J. Susan, . Eggers, M. Henry et al., Simultaneous multithreading: Maximizing on-chip parallelism, In ACM SIGARCH Computer Architecture News, vol.23, pp.392-403, 1995.

M. Robert and . Tomasulo, An efficient algorithm for exploiting multiple arithmetic units, IBM Journal of research and Development, vol.11, issue.1, pp.25-33, 1967.

[. Tuck, M. Dean, and . Tullsen, Initial observations of the simultaneous multithreading Pentium 4 processor, Oceans 2002 Conference and Exhibition. Conference Proceedings (Cat. No.02CH37362), pp.26-34, 2003.
DOI : 10.1109/PACT.2003.1237999

M. Upton, Hyper-threading technology architecture and microarchitecture, Intel Technology Journal Q, vol.1, 2002.

H. M. Sam-likun-xi, P. Jacobson, . Bose, . Gu-yeon, D. M. Wei et al., Quantifying sources of error in mcpat and potential impacts on architectural studies, 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, pp.577-589, 2015.

2. Overview, 4-issue DITVA pipeline. Main changes from SMT are highlighted, p.61

4. Bank-conflicts-for, 84 4.6 Performance scaling with memory bandwidth, relative to 4-thread SMT with 2 GB/s DRAM bandwidth, p.85