diff --git a/README.md b/README.md index 7316770..8c14093 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Directory layout - pytorch/optimized.py Prereqs -- C++17 compiler (g++/clang++) +- GCC C++17 compiler (g++) - OpenMP (optional for cpp_omp) - NVIDIA CUDA toolkit for building cuda/main.cu - Python 3.9+ and PyTorch (with CUDA for GPU runs) @@ -25,7 +25,6 @@ Build g++ -O3 -march=native -std=c++17 -DNDEBUG cpp_single/main.cpp -o bin_cpp_single - OpenMP C++: - Linux/macOS (clang may need -Xpreprocessor -fopenmp and libomp): g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp - CUDA: @@ -67,9 +66,3 @@ Notes Validation - All variants print "result" which should be numerically close across methods (tiny differences expected due to different reduction orders and precision). - -Extensions (optional for class) -- Fuse add+fma into one CUDA kernel to show fewer memory passes. -- Use thrust or cub for reductions. -- Try half-precision (float16/bfloat16) on GPU for bandwidth gains. -- Add vectorized loads (float4) on CPU and CUDA to show further speedups.