Announcing `numr`: A "Batteries-Included" Numerical Library for Rust (NumPy + GPU + Autograd)

⚓ Rust    📅 2026-02-04    👤 surdeus    👁️ 7      

surdeus

Hi everyone,

I’ve started working on a new project called numr, and I wanted to share the vision and get early feedback.

The core idea is simple: What if NumPy was built today, in Rust, with the features we always wished it had built-in?

We all love ndarray and the existing Rust ecosystem, but fragmentation is a real pain point. You often need separate crates for BLAS, LAPACK, sparse arrays, and especially GPU support. If you need gradients, you usually have to switch to a full-blown DL framework like burn or candle.

numr aims to be the foundational numerical layer that unifies these. It is designed to be backend-agnostic, differentiable, and extensible.

:rocket: What Makes numr Different?

1. "Same Code, Any Backend" Architecture
numr is built around a generic Tensor<R: Runtime> abstraction. You write your logic once, and it runs on:

  • CPU: (AVX2/AVX-512/NEON accelerated)
  • CUDA: (Native PTX kernels for NVIDIA)
  • WebGPU: (Cross-platform support for AMD, Intel, and Apple Silicon)

Unlike wrappers around cuBLAS or MKL, numr implements native kernels for operations, meaning no massive external C++ dependencies and full transparency down to the metal.

2. Built-in Autograd (Reverse & Forward Mode)
Differentiation isn't an afterthought. It supports:

  • Reverse-mode: For standard gradient descent/training.
  • Forward-mode: For efficient Jacobian-Vector Products (JVP), crucial for scientific computing tasks like stiff ODE solvers.

3. Modern & Comprehensive Dtypes
Beyond standard f32/f64, numr has native support for:

  • f16 / bf16 (Half precision)
  • fp8 (FP8E4M3, FP8E5M2 for modern ML workloads)
  • Complex numbers (Complex64/128)
  • Sparse Tensors (CSR, CSC, COO formats) are integrated directly, not a separate crate.

:hammer_and_wrench: The "SciPy" Layer: solvr

To prove the robustness of numr, I am simultaneously building solvr, a library for higher-level scientific computing (equivalent to SciPy). It currently implements algorithms for:

  • Optimization: BFGS (using tensor ops, fully GPU-accelerated), simple gradient descent.
  • Integration: Trapezoidal, Simpson's rule, and ODE solvers (RK45, Dop853).
  • Signal Processing: FFT, Convolution, STFT.

Because solvr is built on numr traits, all of these algorithms run seamlessly on CUDA or WebGPU without changing a single line of code.

:warning: Current Status

This is currently experimental (beta) software.

  • The architecture is stable.
  • Many kernels (Matmul, Unary, Binary, Reductions) are implemented for all backends.
  • However, performance tuning (vs. vendor libs) is ongoing, and the API is subject to change.

:link: Check it out

I’m looking for feedback on the API design and contributors who are interested in writing native kernels (WGSL/CUDA/Rust) or high-level scientific algorithms.

Repository:

Example usage:

use numr::prelude::*;

// Define a device (CPU, Cuda, or Wgpu)
let device = CudaRuntime::default_device()?;

// Create tensors directly on GPU
let a = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
let b = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;

// Operations use native GPU kernels
let c = a.matmul(&b)?;

Thanks for reading!

1 post - 1 participant

Read full topic

🏷️ Rust_feed