Fearless Concurrency on the GPU: Safe GPU kernels in Rust

โš“ Rust    ๐Ÿ“… 2026-06-17    ๐Ÿ‘ค surdeus    ๐Ÿ‘๏ธ 2      

surdeus

Hi all. I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU."

The idea is to bring safety across the GPU launch boundary and into the kernel, and to safely enable high-performance, non-trivial GPU programs. Host code statically composes "device operations" (DeviceOp). Like Rust iterators, a DeviceOp composes lazily and statically: combinators build the pipeline at compile time, and nothing runs until you execute it (sync, async, or as a replayable CUDA graph).

You partition a mutable output tensor into disjoint pieces on the host, and each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references. Kernels are written with single-threaded semantics (logical tile threads), which the compiler maps to thread blocks with managed shared memory. The pieces are provably disjoint and ordering is threaded through mutable references. Safety stays extensible the usual Rust way: wrap unsafe in safe abstractions where you supply the invariants.

#[cutile::entry()]
fn add<const B: i32>(
  z: &mut Tensor<f32, {[B]}>,   // exclusive write
  x: &Tensor<f32, {[-1]}>,      // shared read
  y: &Tensor<f32, {[-1]}>,      // shared read
) {
  let tx = load_tile_like(x, z);
  let ty = load_tile_like(y, z);
  z.store(tx + ty);
}

On a B200 a safe GEMM impl is competitive with cuBLAS, so the safety is effectively free. We're also seeing performance competitive with state-of-the-art on memory-bound, batch-1 inference.

I'd genuinely value this crowd's take on the safe-API design.

1 post - 1 participant

Read full topic

๐Ÿท๏ธ Rust_feed