73 lines
2.4 KiB
Markdown
73 lines
2.4 KiB
Markdown
GPU Programming Performance Workshop
|
|
|
|
Problem
|
|
- Compute for vectors A, B, C of size N (float32):
|
|
1) D = A + B
|
|
2) E = D * C + B
|
|
3) result = sum(E)
|
|
- Repeat for --iters iterations. Report time and estimated GB/s.
|
|
|
|
Directory layout
|
|
- cpp_single/main.cpp
|
|
- cpp_omp/main.cpp
|
|
- cuda/main.cu
|
|
- pytorch/baseline.py
|
|
- pytorch/optimized.py
|
|
|
|
Prereqs
|
|
- GCC C++17 compiler (g++)
|
|
- OpenMP (optional for cpp_omp)
|
|
- NVIDIA CUDA toolkit for building cuda/main.cu
|
|
- uv (Astral's Python package manager) and PyTorch (with CUDA for GPU runs)
|
|
|
|
Build
|
|
- Single-threaded C++:
|
|
g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single
|
|
|
|
- OpenMP C++:
|
|
g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp
|
|
|
|
Note: If using clang++ instead of g++, OpenMP support may require additional setup:
|
|
- On macOS: brew install libomp, then use: clang++ -Xpreprocessor -fopenmp -lomp ...
|
|
- On Linux: install libomp-dev, then use: clang++ -fopenmp ...
|
|
- Or stick with g++ which has built-in OpenMP support
|
|
|
|
- CUDA:
|
|
nvcc -O3 -arch=native cuda/main.cu -o bin_cuda
|
|
If -arch=native not supported, use e.g.:
|
|
nvcc -O3 -arch=sm_80 cuda/main.cu -o bin_cuda
|
|
|
|
Run
|
|
- CPU single-thread:
|
|
./bin_cpp_single 100000000 10
|
|
|
|
- CPU OpenMP (set threads):
|
|
export OMP_NUM_THREADS=8
|
|
./bin_cpp_omp 100000000 10
|
|
|
|
- CUDA:
|
|
./bin_cuda 100000000 10
|
|
|
|
- PyTorch baseline (CPU or GPU auto-detect):
|
|
uv run pytorch/baseline.py --N 100000000 --iters 10 --device cuda
|
|
uv run pytorch/baseline.py --N 100000000 --iters 10 --device cpu
|
|
|
|
- PyTorch optimized:
|
|
uv run pytorch/optimized.py --N 100000000 --iters 10
|
|
|
|
Notes
|
|
- Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory.
|
|
- If you hit OOM on GPU, reduce N (e.g., 50_000_000).
|
|
- Throughput model assumes 7 floats per element per iter moved; actual may vary.
|
|
- For fair GPU timing, we synchronize after each iter.
|
|
- To compare kernel launch overhead, try small N (e.g., 1_000_000) and more iters.
|
|
- To compare bandwidth limits, try large N (e.g., 200_000_000) and fewer iters.
|
|
- PyTorch optimized uses pinned memory, in-place ops, preallocation, and CUDA Graphs.
|
|
- You can profile with:
|
|
- nsight systems: nsys profile ./bin_cuda 50000000 50
|
|
- nvprof (legacy): nvprof ./bin_cuda 50000000 50
|
|
- torch profiler: see torch.profiler in docs.
|
|
|
|
Validation
|
|
- All variants print "result" which should be numerically close across methods
|
|
(tiny differences expected due to different reduction orders and precision).
|