init

2025-09-06 09:41:18 +05:30 · 2025-09-06 09:41:18 +05:30 · ea7dcba939
commit ea7dcba939
6 changed files with 481 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,75 @@
+GPU Programming Performance Workshop
+
+Problem
+- Compute for vectors A, B, C of size N (float32):
+  1) D = A + B
+  2) E = D * C + B
+  3) result = sum(E)
+- Repeat for --iters iterations. Report time and estimated GB/s.
+
+Directory layout
+- cpp_single/main.cpp
+- cpp_omp/main.cpp
+- cuda/main.cu
+- pytorch/baseline.py
+- pytorch/optimized.py
+
+Prereqs
+- C++17 compiler (g++/clang++)
+- OpenMP (optional for cpp_omp)
+- NVIDIA CUDA toolkit for building cuda/main.cu
+- Python 3.9+ and PyTorch (with CUDA for GPU runs)
+
+Build
+- Single-threaded C++:
+  g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single
+
+- OpenMP C++:
+  Linux/macOS (clang may need -Xpreprocessor -fopenmp and libomp):
+  g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp
+
+- CUDA:
+  nvcc -O3 -arch=native cuda/main.cu -o bin_cuda
+  If -arch=native not supported, use e.g.:
+  nvcc -O3 -arch=sm_80 cuda/main.cu -o bin_cuda
+
+Run
+- CPU single-thread:
+  ./bin_cpp_single 100000000 10
+
+- CPU OpenMP (set threads):
+  export OMP_NUM_THREADS=8
+  ./bin_cpp_omp 100000000 10
+
+- CUDA:
+  ./bin_cuda 100000000 10
+
+- PyTorch baseline (CPU or GPU auto-detect):
+  python pytorch/baseline.py --N 100000000 --iters 10 --device cuda
+  python pytorch/baseline.py --N 100000000 --iters 10 --device cpu
+
+- PyTorch optimized:
+  python pytorch/optimized.py --N 100000000 --iters 10
+
+Notes
+- Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory.
+- If you hit OOM on GPU, reduce N (e.g., 50_000_000).
+- Throughput model assumes 7 floats per element per iter moved; actual may vary.
+- For fair GPU timing, we synchronize after each iter.
+- To compare kernel launch overhead, try small N (e.g., 1_000_000) and more iters.
+- To compare bandwidth limits, try large N (e.g., 200_000_000) and fewer iters.
+- PyTorch optimized uses pinned memory, in-place ops, preallocation, and CUDA Graphs.
+- You can profile with:
+  - nsight systems: nsys profile ./bin_cuda 50000000 50
+  - nvprof (legacy): nvprof ./bin_cuda 50000000 50
+  - torch profiler: see torch.profiler in docs.
+
+Validation
+- All variants print "result" which should be numerically close across methods
+  (tiny differences expected due to different reduction orders and precision).
+
+Extensions (optional for class)
+- Fuse add+fma into one CUDA kernel to show fewer memory passes.
+- Use thrust or cub for reductions.
+- Try half-precision (float16/bfloat16) on GPU for bandwidth gains.
+- Add vectorized loads (float4) on CPU and CUDA to show further speedups.