An empirical study of the scalability aspects of instruction distribution algorithms for clustered processors, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS., pp.172-179 ,
DOI : 10.1109/ISPASS.2001.990696
April: a processor architecture for multiprocessing, Computer Architecture Proceedings., 17th Annual International Symposium on, pp.104-114, 1990. ,
DOI : 10.1109/isca.1990.134498
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.2437
Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the, pp.483-485, 1967. ,
Migrating from sse2 vector operations to avx2 vector operations, 2014. ,
Bulldozer: An Approach to Multithreaded Compute Performance, IEEE Micro, vol.31, issue.2, pp.6-15, 2011. ,
DOI : 10.1109/MM.2011.23
Analyzing Parallel Programs with Pin, Computer, vol.43, issue.3, pp.34-41, 2010. ,
DOI : 10.1109/MC.2010.60
Simultaneous branch and warp interweaving for sustained GPU performance ,
DOI : 10.1145/2366231.2337166
URL : https://hal.archives-ouvertes.fr/ensl-00649650
Dynamically managing the communication-parallelism tradeoff in future clustered processors, ACM SIGARCH Computer Architecture News, pp.49-60, 2003. ,
A multithreaded powerpc processor for commercial servers, IBM Journal of Research and Development, vol.44, issue.6, pp.885-898, 2000. ,
AMD Fusion APU: Llano, IEEE Micro, vol.32, issue.2, pp.28-37, 2012. ,
DOI : 10.1109/MM.2012.2
The PARSEC benchmark suite, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.72-81, 2008. ,
DOI : 10.1145/1454115.1454128
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors, Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pp.337-347, 2000. ,
DOI : 10.1109/micro.2000.898083
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.3107
Register scorboarding on a microprocessor chip, US Patent, vol.4891, p.753, 1990. ,
Rodinia: A benchmark suite for heterogeneous computing, Workload Characterization IEEE International Symposium on, pp.44-54, 2009. ,
Dynamic detection of uniform and affine vectors in gpgpu computations [ci7] Intel ® core?i7-5960x processor extreme edition, European Conference on Parallel Processing, pp.46-55, 2009. ,
Stack-less simt reconvergence at low cost, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00622654
Dynamic cluster assignment mechanisms, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), pp.133-142, 2000. ,
DOI : 10.1109/HPCA.2000.824345
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.8960
Dynamically Controlled Resource Allocation in SMT Processors, 37th International Symposium on Microarchitecture (MICRO-37'04), pp.171-182, 2004. ,
DOI : 10.1109/MICRO.2004.17
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.146.3226
Clustered multithreaded architectures-pursuing both ipc and cycle time, Parallel and Distributed Processing Symposium Proceedings. 18th International, p.76, 2004. ,
Multithreaded instruction sharing, 2010. ,
Benjamin Ashbaugh, and Subramaniam Maiyuran. SIMD reconvergence at thread frontiers, MICRO 44: Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, 2011. ,
OpenMP: an industry standard API for shared-memory programming, IEEE Computational Science and Engineering, vol.5, issue.1, pp.46-55, 1998. ,
DOI : 10.1109/99.660313
Design of ion-implanted mosfet's with very small physical dimensions. Solid- State Circuits, IEEE Journal, vol.9, issue.5, pp.256-268, 1974. ,
Cash: Revisiting hardware sharing in single-chip parallel processor, 2002. ,
URL : https://hal.archives-ouvertes.fr/inria-00071925
Front-end policies for improved issue efficiency in SMT processors, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pp.31-40, 2003. ,
DOI : 10.1109/HPCA.2003.1183522
Looking back on the language and hardware revolutions, ACM SIGARCH Computer Architecture News, vol.39, issue.1, pp.319-332, 2011. ,
DOI : 10.1145/1961295.1950402
A memory-level parallelism aware fetch policy for SMT processors, 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), pp.240-249, 2007. ,
Evaluation of multithreaded uniprocessors for commercial application environments, In ACM SIGARCH Computer Architecture News, vol.24, pp.203-212, 1996. ,
Intel avx: New frontiers in performance improvements and energy efficiency, 2008. ,
Very high-speed computing systems, Proceedings of the IEEE, vol.54, issue.12, pp.1901-1909, 1966. ,
Some computer organizations and their effectiveness . Computers, IEEE Transactions on, vol.100, issue.9, pp.948-960, 1972. ,
Designing and building parallel programs, 1995. ,
Dynamic warp formation and scheduling for efficient gpu control flow, Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp.407-420, 2007. ,
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware, ACM Trans. Archit. Code Optim, vol.67, pp.1-7, 2009. ,
Thread fusion, Proceeding of the thirteenth international symposium on Low power electronics and design, ISLPED '08, pp.363-368, 2008. ,
DOI : 10.1145/1393921.1394018
Big. little processing with arm cortex-a15 & cortex-a7, pp.1-8, 2011. ,
Branch prediction and simultaneous multithreading, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, pp.169-173, 1996. ,
DOI : 10.1109/PACT.1996.552664
URL : https://hal.archives-ouvertes.fr/inria-00073847
Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp.64-67, 1999. ,
DOI : 10.1109/HPCA.1999.744331
URL : https://hal.archives-ouvertes.fr/inria-00073298
https://software.intel.com/sites/landingpage/IntrinsicsGuid Accessed, pp.2016-2017 ,
Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, vol.25, issue.2, pp.21-29, 2005. ,
DOI : 10.1109/MM.2005.35
Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016. ,
DOI : 10.1109/SBAC-PAD.2016.11
URL : https://hal.archives-ouvertes.fr/hal-01356202
Extending OpenMP* with Vector Constructs for Modern Multicore SIMD Architectures, International Workshop on OpenMP, pp.59-72, 2012. ,
DOI : 10.1007/978-3-642-30961-8_5
The alpha 21264 microprocessor, Micro, IEEE, vol.19, issue.2, pp.24-36, 1999. ,
A comparison of scalable superscalar processors, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, pp.126-137, 1999. ,
Conjoinedcore chip multiprocessing, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, pp.195-206, 2004. ,
DOI : 10.1109/micro.2004.12
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.7776
Activity Counter: New Optimization for the dynamic scheduling of SIMD Control Flow, 1993 International Conference on Parallel Processing, ICPP'93 Vol2, pp.184-187, 1993. ,
DOI : 10.1109/ICPP.1993.36
A clustered approach to multithreaded processors, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pp.627-634, 1998. ,
DOI : 10.1109/IPPS.1998.669992
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.229.3093
Heterogeneous chip multiprocessors, Computer, vol.38, issue.11, pp.32-38, 2005. ,
DOI : 10.1109/MC.2005.379
Programming massively parallel processors: a hands-on approach, 2012. ,
McPAT, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.469-480, 2009. ,
DOI : 10.1145/1669112.1669172
Pin, ACM SIGPLAN Notices, vol.40, issue.6, pp.190-200, 2005. ,
DOI : 10.1145/1064978.1065034
Multimedia extensions for general-purpose processors, 1997 IEEE Workshop on Signal Processing Systems. SiPS 97 Design and Implementation formerly VLSI Signal Processing ,
DOI : 10.1109/SIPS.1997.625683
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.7303
Minimal multithreading: Finding and removing redundant instructions in multithreaded processors, Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp.337-348, 2010. ,
DOI : 10.1109/micro.2010.41
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.295.9883
Boosting SMT performance by speculation control, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, p.2, 2001. ,
DOI : 10.1109/IPDPS.2001.924929
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8887
Balancing thoughput and fairness in SMT processors, IEEE International Symposium on Performance Analysis of Systems and Software, pp.164-171, 2001. ,
HARP, ACM Transactions on Embedded Computing Systems, vol.13, issue.3s, p.114, 2014. ,
DOI : 10.1007/s02011-011-1137-8
NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol.28, issue.2, pp.39-55, 2008. ,
DOI : 10.1109/MM.2008.31
Chap - a SIMD graphics processor, ACM SIGGRAPH Computer Graphics, vol.18, issue.3, pp.77-82, 1984. ,
DOI : 10.1145/964965.808581
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.210.5888
Method for conditional branch execution in simd vector processors, US Patent, vol.4435, p.758, 1984. ,
Cramming more components onto integrated circuits, Proceedings of the IEEE, pp.82-85, 1998. ,
Montecito: A dual-core, dualthread itanium processor, IEEE micro, issue.2, pp.10-20, 2005. ,
Execution Drafting: Energy Efficiency through Computation Deduplication, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp.432-444, 2014. ,
DOI : 10.1109/MICRO.2014.43
Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads, Parallel Computing, vol.40, issue.9, pp.548-558, 2014. ,
DOI : 10.1016/j.parco.2014.03.006
URL : https://hal.archives-ouvertes.fr/hal-01087054
iGPU, ACM SIGARCH Computer Architecture News, pp.72-83, 2012. ,
DOI : 10.1145/2366231.2337168
An Evaluation of Vectorizing Compilers, 2011 International Conference on Parallel Architectures and Compilation Techniques, pp.372-382, 2011. ,
DOI : 10.1109/PACT.2011.68
Optimizing application performance on x64 processor-based systems with pgi compilers and tools The Portland Group, 2007. ,
Register renaming and dynamic speculation: an alternative approach, Proceedings of the 26th Annual International Symposium on Microarchitecture, pp.202-213, 1993. ,
DOI : 10.1109/MICRO.1993.282756
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling, 2009 IEEE International Conference on Computer Design, pp.282-288, 2009. ,
DOI : 10.1109/ICCD.2009.5413143
Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.235-246, 2010. ,
DOI : 10.1145/1816038.1815992
Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.235-246, 2010. ,
DOI : 10.1145/1816038.1815992
The OpenCL specification, 2009 IEEE Hot Chips 21 Symposium (HCS), pp.1-314, 2009. ,
DOI : 10.1109/HOTCHIPS.2009.7478342
Autovectorization in gcc, Proceedings of the 2004 GCC Developers Summit, pp.105-118, 2004. ,
Scalable parallel programming with CUDA, Queue, vol.6, issue.2, pp.40-53, 2008. ,
DOI : 10.1145/1365490.1365500
The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010. ,
DOI : 10.1109/MM.2010.41
The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010. ,
DOI : 10.1109/MM.2010.41
Compute unified device architecture programming guide, 2007. ,
Autovectorization in gcc?two years later, Proceedings of the 2006 GCC Developers Summit, pp.145-158, 2006. ,
Highlights of the high-bandwidth memory (hbm) standard, Memory Forum Workshop, 2014. ,
The case for a single-chip multiprocessor, ACM Sigplan Notices, issue.9, pp.312-323, 1996. ,
The design and implementation of a first-generation cell processor-a multi-core soc, Integrated Circuit Design and Technology, 2005. ICICDT 2005. 2005 International Conference on, pp.49-52, 2005. ,
Big. little processing with arm cortex-a15 & cortex-a7, 2013. ,
Computer organization and design: the hardware/software interface, 2013. ,
Complexity-effective superscalar processors, 1997. ,
DOI : 10.1145/384286.264201
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.527.5571
ispc: A spmd compiler for highperformance cpu programming, Innovative Parallel Computing (InPar), 2012, pp.1-13, 2012. ,
Advanced compiler optimizations for supercomputers, Commun. ACM, vol.29, issue.12, pp.1184-1201, 1986. ,
Opensparc: An open platform for hardware reliability experimentation, Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE). Citeseer, 2008. ,
Compiling c* programs for a hypercube multicomputer, In ACM SIGPLAN Notices, vol.23, pp.57-65, 1988. ,
Intel ® avx-512 instructions, 2013. ,
Additional intel ® avx-512 instructions, 2014. ,
Register organization for media processing, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), pp.375-386, 2000. ,
DOI : 10.1109/HPCA.2000.824366
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.7602
The CRAY-1 computer system, Communications of the ACM, vol.21, issue.1, pp.63-72, 1978. ,
DOI : 10.1145/359327.359336
Ultrasparc t2: A highly-treaded, power-efficient, sparc soc, Solid-State Circuits Conference ASSCC'07. IEEE Asian, pp.22-25, 2007. ,
Analysis of multithreaded architectures for parallel computing, Proceedings of the second annual ACM symposium on Parallel algorithms and architectures , SPAA '90, pp.169-178, 1990. ,
DOI : 10.1145/97444.97683
A new case for the TAGE branch predictor, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pp.117-127, 2011. ,
DOI : 10.1145/2155620.2155635
URL : https://hal.archives-ouvertes.fr/hal-00639193
Design tradeoffs for the alpha EV8 conditional branch predictor, 29th International Symposium on Computer Architecture, pp.25-29, 2002. ,
Opencl: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, vol.12, issue.1-3, pp.66-73, 2010. ,
Implementing streaming simd extensions on thethepentium iii processor, 2000. ,
Architecture and applications of the hep multiprocessor computer system, 25th Annual Technical Symposium, pp.241-248, 1982. ,
A mechanism for SIMD execution of SPMD programs, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97, pp.529-534, 1997. ,
DOI : 10.1109/HPC.1997.592203
The effectiveness of multiple hardware contexts, ACM SIGPLAN Notices, vol.29, issue.11, pp.328-337, 1994. ,
DOI : 10.1145/195470.195583
Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp.191-202, 1996. ,
Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor, In ACM SIGARCH Computer Architecture News, vol.24, pp.191-202, 1996. ,
Simultaneous multithreading: Maximizing on-chip parallelism, In ACM SIGARCH Computer Architecture News, vol.23, pp.392-403, 1995. ,
An efficient algorithm for exploiting multiple arithmetic units, IBM Journal of research and Development, vol.11, issue.1, pp.25-33, 1967. ,
Initial observations of the simultaneous multithreading Pentium 4 processor, Oceans 2002 Conference and Exhibition. Conference Proceedings (Cat. No.02CH37362), pp.26-34, 2003. ,
DOI : 10.1109/PACT.2003.1237999
Hyper-threading technology architecture and microarchitecture, Intel Technology Journal Q, vol.1, 2002. ,
Quantifying sources of error in mcpat and potential impacts on architectural studies, 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, pp.577-589, 2015. ,
4-issue DITVA pipeline. Main changes from SMT are highlighted, p.61 ,
84 4.6 Performance scaling with memory bandwidth, relative to 4-thread SMT with 2 GB/s DRAM bandwidth, p.85 ,