From baa9bc407bab636f2f60df7207b64a68b3cdf9e3 Mon Sep 17 00:00:00 2001 From: sherlock Date: Sat, 6 Sep 2025 10:19:26 +0530 Subject: [PATCH] init --- README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 8c14093..804da42 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Prereqs - GCC C++17 compiler (g++) - OpenMP (optional for cpp_omp) - NVIDIA CUDA toolkit for building cuda/main.cu -- Python 3.9+ and PyTorch (with CUDA for GPU runs) +- uv (Astral's Python package manager) and PyTorch (with CUDA for GPU runs) Build - Single-threaded C++: @@ -27,6 +27,11 @@ Build - OpenMP C++: g++ -O3 -march=native -std=c++17 -fopenmp -DNDEBUG cpp_omp/main.cpp -o bin_cpp_omp + Note: If using clang++ instead of g++, OpenMP support may require additional setup: + - On macOS: brew install libomp, then use: clang++ -Xpreprocessor -fopenmp -lomp ... + - On Linux: install libomp-dev, then use: clang++ -fopenmp ... + - Or stick with g++ which has built-in OpenMP support + - CUDA: nvcc -O3 -arch=native cuda/main.cu -o bin_cuda If -arch=native not supported, use e.g.: @@ -44,11 +49,11 @@ Run ./bin_cuda 100000000 10 - PyTorch baseline (CPU or GPU auto-detect): - python pytorch/baseline.py --N 100000000 --iters 10 --device cuda - python pytorch/baseline.py --N 100000000 --iters 10 --device cpu + uv run pytorch/baseline.py --N 100000000 --iters 10 --device cuda + uv run pytorch/baseline.py --N 100000000 --iters 10 --device cpu - PyTorch optimized: - python pytorch/optimized.py --N 100000000 --iters 10 + uv run pytorch/optimized.py --N 100000000 --iters 10 Notes - Memory: N=100M uses ~400 MB for A,B,C and ~400 MB for D,E. Ensure enough RAM/GPU memory.