This commit is contained in:
sherlock 2025-09-06 09:56:32 +05:30
parent ea7dcba939
commit cb240a6d75

View file

@ -15,7 +15,7 @@ Directory layout
- pytorch/optimized.py - pytorch/optimized.py
Prereqs Prereqs
- C++17 compiler (g++/clang++) - GCC C++17 compiler (g++)
- OpenMP (optional for cpp_omp) - OpenMP (optional for cpp_omp)
- NVIDIA CUDA toolkit for building cuda/main.cu - NVIDIA CUDA toolkit for building cuda/main.cu
- Python 3.9+ and PyTorch (with CUDA for GPU runs) - Python 3.9+ and PyTorch (with CUDA for GPU runs)
@ -25,7 +25,6 @@ Build
g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single
- OpenMP C++: - OpenMP C++:
Linux/macOS (clang may need -Xpreprocessor -fopenmp and libomp):
g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp
- CUDA: - CUDA:
@ -67,9 +66,3 @@ Notes
Validation Validation
- All variants print "result" which should be numerically close across methods - All variants print "result" which should be numerically close across methods
(tiny differences expected due to different reduction orders and precision). (tiny differences expected due to different reduction orders and precision).
Extensions (optional for class)
- Fuse add+fma into one CUDA kernel to show fewer memory passes.
- Use thrust or cub for reductions.
- Try half-precision (float16/bfloat16) on GPU for bandwidth gains.
- Add vectorized loads (float4) on CPU and CUDA to show further speedups.