Fearless Concurrency on the GPU: Safe GPU kernels in Rust
โ Rust ๐ 2026-06-17 ๐ค surdeus ๐๏ธ 2Hi all. I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU."
The idea is to bring safety across the GPU launch boundary and into the kernel, and to safely enable high-performance, non-trivial GPU programs. Host code statically composes "device operations" (DeviceOp). Like Rust iterators, a DeviceOp composes lazily and statically: combinators build the pipeline at compile time, and nothing runs until you execute it (sync, async, or as a replayable CUDA graph).
You partition a mutable output tensor into disjoint pieces on the host, and each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references. Kernels are written with single-threaded semantics (logical tile threads), which the compiler maps to thread blocks with managed shared memory. The pieces are provably disjoint and ordering is threaded through mutable references. Safety stays extensible the usual Rust way: wrap unsafe in safe abstractions where you supply the invariants.
#[cutile::entry()]
fn add<const B: i32>(
z: &mut Tensor<f32, {[B]}>, // exclusive write
x: &Tensor<f32, {[-1]}>, // shared read
y: &Tensor<f32, {[-1]}>, // shared read
) {
let tx = load_tile_like(x, z);
let ty = load_tile_like(y, z);
z.store(tx + ty);
}
On a B200 a safe GEMM impl is competitive with cuBLAS, so the safety is effectively free. We're also seeing performance competitive with state-of-the-art on memory-bound, batch-1 inference.
- Code: GitHub - NVlabs/cutile-rs: cuTile Rust provides a safe, tile-based kernel programming DSL for the Rust programming language. It features a safe host-side API for passing tensors to asynchronously executed kernel functions. ยท GitHub
- Paper: [2606.15991] Fearless Concurrency on the GPU
I'd genuinely value this crowd's take on the safe-API design.
1 post - 1 participant
๐ท๏ธ Rust_feed