init
This commit is contained in:
parent
ea7dcba939
commit
cb240a6d75
1 changed files with 1 additions and 8 deletions
|
@ -15,7 +15,7 @@ Directory layout
|
||||||
- pytorch/optimized.py
|
- pytorch/optimized.py
|
||||||
|
|
||||||
Prereqs
|
Prereqs
|
||||||
- C++17 compiler (g++/clang++)
|
- GCC C++17 compiler (g++)
|
||||||
- OpenMP (optional for cpp_omp)
|
- OpenMP (optional for cpp_omp)
|
||||||
- NVIDIA CUDA toolkit for building cuda/main.cu
|
- NVIDIA CUDA toolkit for building cuda/main.cu
|
||||||
- Python 3.9+ and PyTorch (with CUDA for GPU runs)
|
- Python 3.9+ and PyTorch (with CUDA for GPU runs)
|
||||||
|
@ -25,7 +25,6 @@ Build
|
||||||
g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single
|
g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single
|
||||||
|
|
||||||
- OpenMP C++:
|
- OpenMP C++:
|
||||||
Linux/macOS (clang may need -Xpreprocessor -fopenmp and libomp):
|
|
||||||
g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp
|
g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp
|
||||||
|
|
||||||
- CUDA:
|
- CUDA:
|
||||||
|
@ -67,9 +66,3 @@ Notes
|
||||||
Validation
|
Validation
|
||||||
- All variants print "result" which should be numerically close across methods
|
- All variants print "result" which should be numerically close across methods
|
||||||
(tiny differences expected due to different reduction orders and precision).
|
(tiny differences expected due to different reduction orders and precision).
|
||||||
|
|
||||||
Extensions (optional for class)
|
|
||||||
- Fuse add+fma into one CUDA kernel to show fewer memory passes.
|
|
||||||
- Use thrust or cub for reductions.
|
|
||||||
- Try half-precision (float16/bfloat16) on GPU for bandwidth gains.
|
|
||||||
- Add vectorized loads (float4) on CPU and CUDA to show further speedups.
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue