Back to Articles Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Test: Run 01_matmul_add.py with size 4096 on GPU and analyse profiler output to identify GPU idle time.
Test: Run 01_matmul_add.py with size 4096 on GPU and analyse profiler output to identify GPU idle time.
Summary
Profiling in PyTorch (Part 1) walks developers through using torch.profiler to uncover bottlenecks in a simple matrix multiplication and bias addition routine. The example script `01_matmul_add.py` runs on an NVIDIA A100‑SXM4‑80GB GPU and demonstrates how a 64×64 matrix is heavily CPU‑bound, with less than 1 % of time spent on the GPU kernel. By increasing the matrix size to 4096×4096, the profiler shows a shift to compute‑bound execution, with GPU time rising to 4.5 ms and the kernel `ampere_bf16_s16816gemm` dominating the trace. The profiler exports two artifacts: a statistical table via `prof.key_averages().table` and a Chrome‑trace JSON via `prof.export_chrome_trace`, which can be visualised in Perfetto. Developers can annotate scopes with `torch.profiler.record_function` to label events, and control profiling steps using `torch.profiler.schedule(wait=1, warmup=1, active=3)`. The guide emphasizes that larger workloads reduce CPU‑to‑GPU launch overhead and that profiling should be repeated multiple times to warm up the GPU. By analysing the trace, developers can identify idle periods, kernel launch latency, and opportunities to batch operations for better GPU utilisation.
Key changes
- torch.profiler.profile with activities CPU and CUDA
- record_function annotates scopes
- Export table via prof.key_averages().table
- Export trace via prof.export_chrome_trace
- 64×64 matrix is CPU‑bound, <1% GPU time
- 4096×4096 matrix shifts to compute‑bound, GPU time 4.5 ms
- Perfetto UI visualises trace
- schedule(wait=1,warmup=1,active=3) controls profiling steps