GPU Programming Performance Workshop Problem - Compute for vectors A, B, C of size N (float32): 1) D = A + B 2) E = D * C + B 3) result = sum(E) - Repeat for --iters iterations. Report time and estimated GB/s. Directory layout - cpp_single/main.cpp - cpp_omp/main.cpp - cuda/main.cu - pytorch/baseline.py - pytorch/optimized.py Prereqs - GCC C++17 compiler (g++) - OpenMP (optional for cpp_omp) - NVIDIA CUDA toolkit for building cuda/main.cu - uv (Astral's Python package manager) and PyTorch (with CUDA for GPU runs) Build - Single-threaded C++: g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single - OpenMP C++: g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp Note: If using clang++ instead of g++, OpenMP support may require additional setup: - On macOS: brew install libomp, then use: clang++ -Xpreprocessor -fopenmp -lomp ... - On Linux: install libomp-dev, then use: clang++ -fopenmp ... - Or stick with g++ which has built-in OpenMP support - CUDA: nvcc -O3 -arch=native cuda/main.cu -o bin_cuda If -arch=native not supported, use e.g.: nvcc -O3 -arch=sm_80 cuda/main.cu -o bin_cuda Run - CPU single-thread: ./bin_cpp_single 100000000 10 - CPU OpenMP (set threads): export OMP_NUM_THREADS=8 ./bin_cpp_omp 100000000 10 - CUDA: ./bin_cuda 100000000 10 - PyTorch baseline (CPU or GPU auto-detect): uv run pytorch/baseline.py --N 100000000 --iters 10 --device cuda uv run pytorch/baseline.py --N 100000000 --iters 10 --device cpu - PyTorch optimized: uv run pytorch/optimized.py --N 100000000 --iters 10 Notes - Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory. - If you hit OOM on GPU, reduce N (e.g., 50_000_000). - Throughput model assumes 7 floats per element per iter moved; actual may vary. - For fair GPU timing, we synchronize after each iter. - To compare kernel launch overhead, try small N (e.g., 1_000_000) and more iters. - To compare bandwidth limits, try large N (e.g., 200_000_000) and fewer iters. - PyTorch optimized uses pinned memory, in-place ops, preallocation, and CUDA Graphs. - You can profile with: - nsight systems: nsys profile ./bin_cuda 50000000 50 - nvprof (legacy): nvprof ./bin_cuda 50000000 50 - torch profiler: see torch.profiler in docs. Validation - All variants print "result" which should be numerically close across methods (tiny differences expected due to different reduction orders and precision).