This commit is contained in:
sherlock 2025-09-06 10:19:26 +05:30
parent cb240a6d75
commit baa9bc407b

View file

@ -18,7 +18,7 @@ Prereqs
- GCC C++17 compiler (g++) - GCC C++17 compiler (g++)
- OpenMP (optional for cpp_omp) - OpenMP (optional for cpp_omp)
- NVIDIA CUDA toolkit for building cuda/main.cu - NVIDIA CUDA toolkit for building cuda/main.cu
- Python 3.9+ and PyTorch (with CUDA for GPU runs) - uv (Astral's Python package manager) and PyTorch (with CUDA for GPU runs)
Build Build
- Single-threaded C++: - Single-threaded C++:
@ -27,6 +27,11 @@ Build
- OpenMP C++: - OpenMP C++:
g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp
Note: If using clang++ instead of g++, OpenMP support may require additional setup:
- On macOS: brew install libomp, then use: clang++ -Xpreprocessor -fopenmp -lomp ...
- On Linux: install libomp-dev, then use: clang++ -fopenmp ...
- Or stick with g++ which has built-in OpenMP support
- CUDA: - CUDA:
nvcc -O3 -arch=native cuda/main.cu -o bin_cuda nvcc -O3 -arch=native cuda/main.cu -o bin_cuda
If -arch=native not supported, use e.g.: If -arch=native not supported, use e.g.:
@ -44,11 +49,11 @@ Run
./bin_cuda 100000000 10 ./bin_cuda 100000000 10
- PyTorch baseline (CPU or GPU auto-detect): - PyTorch baseline (CPU or GPU auto-detect):
python pytorch/baseline.py --N 100000000 --iters 10 --device cuda uv run pytorch/baseline.py --N 100000000 --iters 10 --device cuda
python pytorch/baseline.py --N 100000000 --iters 10 --device cpu uv run pytorch/baseline.py --N 100000000 --iters 10 --device cpu
- PyTorch optimized: - PyTorch optimized:
python pytorch/optimized.py --N 100000000 --iters 10 uv run pytorch/optimized.py --N 100000000 --iters 10
Notes Notes
- Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory. - Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory.