Fine Grained Binary Instrumentation for GPU Profiling
Hsuan-Heng Wu
Abstract
Instrumentation of binary code is a powerful technique that has been applied
in many areas, including collecting fine-grained execution data for
application performance tuning. Our experiences with the Dyninst binary
analysis and instrumentation toolkits have shown that performance data can
be collected efficiently for CPUs, and we are now applying these techniques
to GPUs, specifically to the AMD family of GPUs.
We will first give a brief introduction to the AMD GPU environment. Then we
will discuss our design for efficient GPU kernel instrumentation, including:
device side resource allocation for storing performance data, efficient
transfer of performance data from the GPU to CPU, and keeping the
instrumentation overhead low compared to other approaches. We use binary
instrumentation to avoid the need to recompile the kernel.
In our early experiments with these techniques, we have collected
branch-divergence information, which can be used as an indicator of GPU
hardware utilization, on several GPU kernels. We have also compared our
instrumentation overhead to NVIDIA’s CUTPI for gathering similar
information, and our overhead is consistently much lower than this other
tool.