Announcing `numr`: A "Batteries-Included" Numerical Library for Rust (NumPy + GPU + Autograd)
⚓ Rust 📅 2026-02-04 👤 surdeus 👁️ 7Hi everyone,
I’ve started working on a new project called numr, and I wanted to share the vision and get early feedback.
The core idea is simple: What if NumPy was built today, in Rust, with the features we always wished it had built-in?
We all love ndarray and the existing Rust ecosystem, but fragmentation is a real pain point. You often need separate crates for BLAS, LAPACK, sparse arrays, and especially GPU support. If you need gradients, you usually have to switch to a full-blown DL framework like burn or candle.
numr aims to be the foundational numerical layer that unifies these. It is designed to be backend-agnostic, differentiable, and extensible.
What Makes numr Different?
1. "Same Code, Any Backend" Architecture
numr is built around a generic Tensor<R: Runtime> abstraction. You write your logic once, and it runs on:
- CPU: (AVX2/AVX-512/NEON accelerated)
- CUDA: (Native PTX kernels for NVIDIA)
- WebGPU: (Cross-platform support for AMD, Intel, and Apple Silicon)
Unlike wrappers around cuBLAS or MKL, numr implements native kernels for operations, meaning no massive external C++ dependencies and full transparency down to the metal.
2. Built-in Autograd (Reverse & Forward Mode)
Differentiation isn't an afterthought. It supports:
- Reverse-mode: For standard gradient descent/training.
- Forward-mode: For efficient Jacobian-Vector Products (JVP), crucial for scientific computing tasks like stiff ODE solvers.
3. Modern & Comprehensive Dtypes
Beyond standard f32/f64, numr has native support for:
- f16 / bf16 (Half precision)
- fp8 (FP8E4M3, FP8E5M2 for modern ML workloads)
- Complex numbers (Complex64/128)
- Sparse Tensors (CSR, CSC, COO formats) are integrated directly, not a separate crate.
The "SciPy" Layer: solvr
To prove the robustness of numr, I am simultaneously building solvr, a library for higher-level scientific computing (equivalent to SciPy). It currently implements algorithms for:
- Optimization: BFGS (using tensor ops, fully GPU-accelerated), simple gradient descent.
- Integration: Trapezoidal, Simpson's rule, and ODE solvers (RK45, Dop853).
- Signal Processing: FFT, Convolution, STFT.
Because solvr is built on numr traits, all of these algorithms run seamlessly on CUDA or WebGPU without changing a single line of code.
Current Status
This is currently experimental (beta) software.
- The architecture is stable.
- Many kernels (Matmul, Unary, Binary, Reductions) are implemented for all backends.
- However, performance tuning (vs. vendor libs) is ongoing, and the API is subject to change.
Check it out
I’m looking for feedback on the API design and contributors who are interested in writing native kernels (WGSL/CUDA/Rust) or high-level scientific algorithms.
Repository:
- GitHub - farhan-syah/numr: A high-performance numerical computing library for Rust with GPU acceleration, inspired by Numpy
- GitHub - farhan-syah/solvr: Rust scientific and advanced computing library
Example usage:
use numr::prelude::*;
// Define a device (CPU, Cuda, or Wgpu)
let device = CudaRuntime::default_device()?;
// Create tensors directly on GPU
let a = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
let b = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
// Operations use native GPU kernels
let c = a.matmul(&b)?;
Thanks for reading!
1 post - 1 participant
🏷️ Rust_feed