init

2025-09-06 10:19:26 +05:30 · 2025-09-06 10:19:26 +05:30 · baa9bc407b
commit baa9bc407b
parent cb240a6d75
1 changed files with 9 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -18,7 +18,7 @@ Prereqs
 - GCC C++17 compiler (g++)
 - OpenMP (optional for cpp_omp)
 - NVIDIA CUDA toolkit for building cuda/main.cu
- Python 3.9+ and PyTorch (with CUDA for GPU runs)
+- uv (Astral's Python package manager) and PyTorch (with CUDA for GPU runs)

 Build
 - Single-threaded C++:
@ -27,6 +27,11 @@ Build
 - OpenMP C++:
  g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp

+  Note: If using clang++ instead of g++, OpenMP support may require additional setup:
+  - On macOS: brew install libomp, then use: clang++ -Xpreprocessor -fopenmp -lomp ...
+  - On Linux: install libomp-dev, then use: clang++ -fopenmp ...
+  - Or stick with g++ which has built-in OpenMP support
+
 - CUDA:
  nvcc -O3 -arch=native cuda/main.cu -o bin_cuda
  If -arch=native not supported, use e.g.:
@ -44,11 +49,11 @@ Run
  ./bin_cuda 100000000 10

 - PyTorch baseline (CPU or GPU auto-detect):
-  python pytorch/baseline.py --N 100000000 --iters 10 --device cuda
-  python pytorch/baseline.py --N 100000000 --iters 10 --device cpu
+  uv run pytorch/baseline.py --N 100000000 --iters 10 --device cuda
+  uv run pytorch/baseline.py --N 100000000 --iters 10 --device cpu

 - PyTorch optimized:
-  python pytorch/optimized.py --N 100000000 --iters 10
+  uv run pytorch/optimized.py --N 100000000 --iters 10

 Notes
 - Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory.