Efficient Instrumentation and Tracepoint Insertion for GPU Compute Kernels
Sébastien Darche
Abstract
Reducing instrumentation noise and overhead in software development tools is a
major factor in providing insightful performance reports for programmers. This is particularly
challenging when tracing highly parallel GPU compute kernels, which poses many challenges
for instrumentation, data movement, and trace analysis. Furthermore, while complex kernels
usually benefit the most from tracing tools, the instrumentation overhead often strongly
correlates with the kernel complexity. Thus, GPU compute kernel tracing is a prime candidate
for improvement.
We propose a method for efficient tracepoint placement in GPU compute kernels, by leveraging
properties derived from static analysis of the control flow graph (CFG) at compilation
time. This is enabled by GPUs relying on stack-based SIMT control flow, allowing for
postmortem computation of vector control flow data. Compared to current tracing methods,
our approach can reduce the number of instrumentation points, while guaranteeing the same
level of detail when processing the trace.
We evaluate the reference implementation of our method on a comprehensive scientific
computing benchmark, obtaining on average a reduction of 59% on the total number of inserted
tracepoints, thus reducing runtime overhead and total trace size, when tracing a program.
The reference implementation is freely available and integrated into a complete GPU tracing
tool, ready for use by programmers.