Recognition, mining and synthesis moves computers to the era of tera, Technology@ Intel Magazine, vol.9, issue.2, pp.1-10, 2005. ,
Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle, Proceedings of the conference on Operating Systems Design and Implementation, OSDI'16, pp.603-618, 2016. ,
Structured comparative analysis of systems logs to diagnose performance problems, Proceedings of the conference on Networked Systems Design and Implementation, NSDI'12, pp.26-26, 2012. ,
Analyzing lock contention in multithreaded applications, Proceedings of the symposium on Principles and Practices of Parallel Programming, PPoPP'10, pp.269-280, 2010. ,
Comprehending performance from real-world execution traces, Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, ASPLOS '14, pp.193-206, 2014. ,
DOI : 10.1145/2541940.2541968
Continuously measuring critical section pressure with the free-lunch profiler, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications , OOPSLA'14, pp.291-307, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01080277
Performance analysis of idle programs, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'10, pp.739-753, 2010. ,
Cmp $ im: A pinbased on-the-fly multi-core cache simulator, Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, pp.28-36, 2008. ,
Assessing cache false sharing effects by dynamic binary instrumentation, Proceedings of the Workshop on Binary Instrumentation and Applications, pp.26-33, 2009. ,
Dynamic cache contention detection in multi-threaded applications, Proceedings of the international conference on Virtual Execution Environments, pp.27-38, 2011. ,
DOI : 10.1145/1952682.1952688
URL : http://www.cag.lcs.mit.edu/commit/papers/2011/zhao-vee11-cache-contention.pdf
Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior, Proceedings of the International Symposium on Computer Architecture , ISCA'13, pp.511-522, 2013. ,
Trace-based automatic padding for locality improvement with correlative data visualization interface, Proceedings of the International Conference on Parallel Architectures and Compilation , PACT'07, 2007. ,
DOI : 10.1109/ipdps.2008.4536472
PREDATOR: Predictive false sharing detection, Proceedings of the symposium on Principles and Practices of Parallel Programming, PPoPP'14, pp.3-14, 2014. ,
Cheetah: detecting false sharing efficiently and effectively, Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pp.1-11, 2016. ,
DOI : 10.1145/1952682.1952688
URL : http://dl.acm.org/ft_gateway.cfm?id=2854039&type=pdf
There goes the neighborhood, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, pp.1-12, 2013. ,
DOI : 10.1145/2503210.2503247
Active measurement of the impact of network switch utilization on application performance, Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS'14, pp.165-174, 2014. ,
Linux profiling with performance counters ,
Statistical debugging for real-world performance problems, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'14, pp.561-578, 2014. ,
DOI : 10.1145/2714064.2660234
URL : https://minds.wisconsin.edu/bitstream/handle/1793/68592/TR1803-1.pdf?sequence=3
The SPLASH-2 programs: Characterization and methodological considerations, Proceedings of the International Symposium on Computer Architecture , ISCA'95, pp.24-36, 1995. ,
Evaluating MapReduce for Multi-core and Multiprocessor Systems, 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pp.13-24, 2007. ,
DOI : 10.1109/HPCA.2007.346181
URL : http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf
The PARSEC benchmark suite, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.72-81, 2008. ,
DOI : 10.1145/1454115.1454128
Distributed caching with memcached. Linux journal, p.5, 2004. ,
URL: http://leveldb.org, 2011. ,
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010. ,
DOI : 10.1109/PDP.2010.67
URL : https://hal.archives-ouvertes.fr/inria-00429889
Memory affinity for hierarchical shared memory multiprocessors, Computer Architecture and High Performance Computing SBAC-PAD'09. 21st International Symposium on, pp.59-66, 2009. ,
URL : https://hal.archives-ouvertes.fr/hal-00788914
A case for NUMA-aware contention management on multicore systems, Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pp.557-558, 2010. ,
DOI : 10.1145/1854273.1854350
False sharing and its effect on shared memory performance, Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS), p.57, 1993. ,
Scheduling Dynamic OpenMP Applications over Multicore Architectures, OpenMP in a New Era of Parallelism, pp.170-180, 2008. ,
DOI : 10.1007/978-3-540-79561-2_15
URL : https://hal.archives-ouvertes.fr/inria-00329934
Traffic management: A holistic approach to memory placement on numa systems, Proceedings of the conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'13, pp.381-394, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00945758
Scaling parallel I/O performance through I/O delegate and caching system, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, p.9, 2008. ,
DOI : 10.1109/SC.2008.5214358
URL : http://users.eecs.northwestern.edu/~choudhar/Publications/ScalingParallelIOPerformanceThroughIODelgateAndCachingSystem.pdf
Everest: Scaling down peak loads through i/o off-loading, OSDI, pp.15-28, 2008. ,
The case for RAMClouds, ACM SIGOPS Operating Systems Review, vol.43, issue.4, pp.92-105, 2010. ,
DOI : 10.1145/1713254.1713276
The Design and Implementation of FFTW3, Proceedings of the IEEE, pp.216-231, 2005. ,
DOI : 10.1109/JPROC.2004.840301
Mcb: A continuous energy monte carlo burnup simulation code. Actinide and fission product partitioning and transmutation, 1999. ,
hypre: A library of high performance preconditioners, International Conference on Computational Science, pp.632-641, 2002. ,
Scaling tests of the improved Kogut-Susskind quark action, Physical Review D, vol.131, issue.11, p.61111502, 2000. ,
DOI : 10.1143/PTPS.131.573
Order-n multiple scattering approach to electronic structure calculations. Physical review letters, p.752867, 1995. ,
DOI : 10.1103/physrevlett.75.2867
Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications, Proceedings of the Usenix Annual Technical Conference, USENIX ATC'12, pp.65-76, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00779908
Fast and portable locking for multicore architectures, ACM Transactions on Computer Systems (TOCS), vol.3313, issue.4, pp.1-1362, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01252167
Transactional memory: Architectural support for lock-free data structures, 1993. ,
DOI : 10.1109/isca.1993.698569
URL : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-895-theory-of-parallel-systems-sma-5509-fall-2003/readings/herlihy_mo93.pdf
CacheIn: A Toolset for Comprehensive Cache Inspection, Proceedings of the International Conference on Computational Science, ICCS'05, pp.174-181, 2005. ,
DOI : 10.1007/11428848_22
SHERIFF: Precise detection and automatic mitigation of false sharing, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, pp.3-18, 2011. ,
Whose cache line is it anyway?, Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pp.141-154, 2013. ,
DOI : 10.1145/2465351.2465366
Experiences and lessons learned with a portable interface to hardware performance counters, Proceedings International Parallel and Distributed Processing Symposium, pp.289-291, 2003. ,
DOI : 10.1109/IPDPS.2003.1213517
Detection of false sharing using machine learning, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.1-9, 2013. ,
Locating cache performance bottlenecks using data profiling, Proceedings of the 5th European conference on Computer systems, EuroSys '10, pp.335-348, 2010. ,
DOI : 10.1145/1755913.1755947
URL : http://pdos.csail.mit.edu/papers/dprof:eurosys10.pdf
Contention aware execution, Proceedings of the 8th annual IEEE/ ACM international symposium on Code generation and optimization, CGO '10, pp.257-265, 2010. ,
DOI : 10.1145/1772954.1772991
Cache contention and application performance prediction for multi-core systems, Proceedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS'10, pp.76-86, 2010. ,
Application-specific quantum for multi-core platform scheduler, Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pp.1-3, 2016. ,
DOI : 10.1109/PCCC.2012.6407650
URL : https://hal.archives-ouvertes.fr/hal-01782587
Scarlett, Proceedings of the sixth conference on Computer systems, EuroSys '11, pp.287-300, 2011. ,
DOI : 10.1145/1966445.1966472
Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems, FAST, 2012. ,
Memprof: A memory profiler for numa multicore systems, Proceedings of the Usenix Annual Technical Conference, USENIX ATC'12, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00945731
Operating system profiling via latency analysis, Proceedings of the conference on Operating Systems Design and Implementation, OSDI'06, pp.89-102, 2006. ,
DOI : 10.1145/1095810.1118607
Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications, 2012 IEEE International Symposium on Performance Analysis of Systems & Software, pp.145-155, 2012. ,
DOI : 10.1109/ISPASS.2012.6189221
URL : http://users.elis.ugent.be/~leeckhou/papers/ispass12_2.pdf
EZTrace: A Generic Framework for Performance Analysis, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.618-619, 2011. ,
DOI : 10.1109/CCGrid.2011.83
URL : https://hal.archives-ouvertes.fr/inria-00587216
Visual trace explorer (vite), 2009. ,
SimGrid: A Generic Framework for Large-Scale Distributed Experiments, Tenth International Conference on Computer Modeling and Simulation (uksim 2008), pp.126-131, 2008. ,
DOI : 10.1109/UKSIM.2008.28
URL : https://hal.archives-ouvertes.fr/inria-00260697
Analysis of PARSEC workload scalability, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp.133-142, 2016. ,
DOI : 10.1109/ISPASS.2016.7482081
Deconstructing the overhead in parallel applications, 2012 IEEE International Symposium on Workload Characterization (IISWC), pp.59-68, 2012. ,
DOI : 10.1109/IISWC.2012.6402901
Benchmarking memory performance with the data cube operator, 2004. ,
memaslap: Load testing and benchmarking a server ,
An interface to implement NUMA policies in the xen hypervisor, Proceedings of the EuroSys European Conference on Computer Systems, EuroSys'17, p.14, 2017. ,