No description
Find a file
2025-09-06 09:41:18 +05:30
cpp_omp init 2025-09-06 09:41:18 +05:30
cpp_single init 2025-09-06 09:41:18 +05:30
cuda init 2025-09-06 09:41:18 +05:30
pytorch init 2025-09-06 09:41:18 +05:30
README.md init 2025-09-06 09:41:18 +05:30

GPU Programming Performance Workshop

Problem

  • Compute for vectors A, B, C of size N (float32):
    1. D = A + B
    2. E = D * C + B
    3. result = sum(E)
  • Repeat for --iters iterations. Report time and estimated GB/s.

Directory layout

  • cpp_single/main.cpp
  • cpp_omp/main.cpp
  • cuda/main.cu
  • pytorch/baseline.py
  • pytorch/optimized.py

Prereqs

  • C++17 compiler (g++/clang++)
  • OpenMP (optional for cpp_omp)
  • NVIDIA CUDA toolkit for building cuda/main.cu
  • Python 3.9+ and PyTorch (with CUDA for GPU runs)

Build

  • Single-threaded C++: g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single

  • OpenMP C++: Linux/macOS (clang may need -Xpreprocessor -fopenmp and libomp): g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp

  • CUDA: nvcc -O3 -arch=native cuda/main.cu -o bin_cuda If -arch=native not supported, use e.g.: nvcc -O3 -arch=sm_80 cuda/main.cu -o bin_cuda

Run

  • CPU single-thread: ./bin_cpp_single 100000000 10

  • CPU OpenMP (set threads): export OMP_NUM_THREADS=8 ./bin_cpp_omp 100000000 10

  • CUDA: ./bin_cuda 100000000 10

  • PyTorch baseline (CPU or GPU auto-detect): python pytorch/baseline.py --N 100000000 --iters 10 --device cuda python pytorch/baseline.py --N 100000000 --iters 10 --device cpu

  • PyTorch optimized: python pytorch/optimized.py --N 100000000 --iters 10

Notes

  • Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory.
  • If you hit OOM on GPU, reduce N (e.g., 50_000_000).
  • Throughput model assumes 7 floats per element per iter moved; actual may vary.
  • For fair GPU timing, we synchronize after each iter.
  • To compare kernel launch overhead, try small N (e.g., 1_000_000) and more iters.
  • To compare bandwidth limits, try large N (e.g., 200_000_000) and fewer iters.
  • PyTorch optimized uses pinned memory, in-place ops, preallocation, and CUDA Graphs.
  • You can profile with:
    • nsight systems: nsys profile ./bin_cuda 50000000 50
    • nvprof (legacy): nvprof ./bin_cuda 50000000 50
    • torch profiler: see torch.profiler in docs.

Validation

  • All variants print "result" which should be numerically close across methods (tiny differences expected due to different reduction orders and precision).

Extensions (optional for class)

  • Fuse add+fma into one CUDA kernel to show fewer memory passes.
  • Use thrust or cub for reductions.
  • Try half-precision (float16/bfloat16) on GPU for bandwidth gains.
  • Add vectorized loads (float4) on CPU and CUDA to show further speedups.