Open source LLM compiler in Rust for models on Huggingface

⚓ Rust    📅 2026-03-13    👤 surdeus    👁️ 2      

surdeus

Open source LLM compiler for models on Huggingface. 152 tok/s. 11.3W. 5.3B CPU instructions. mlx-lm: 113 tok/s. 14.1W. 31.4B CPU instructions on macbook M1 Pro.

UNC is 1.35x faster while using 25% less GPU power, resulting in 1.7x better energy efficiency. The compiled approach eliminates Python runtime and framework dispatch overhead entirely — 8.4x fewer CPU instructions means less heat, less power, and more headroom for the GPU.

UNC uses an LLVM inspired IR frontend for weight+architecture abstraction before applying optimisations

HuggingFace model
       |
  [ Frontend ]        Parse config.json + safetensors
       |
  [ IR Graph ]        Hardware-agnostic tensor graph
       |
  [ Compiler ]        Fusion, quantization, memory planning
       |
       +------------------+------------------+------------------+
       |                  |                  |                  |
  [ Metal ]          [ CUDA ]          [ ROCm ]          [ WASM ]
  Obj-C + Metal      PTX kernels       HIP kernels       WebGPU shaders
  shaders            (planned)         (planned)         (planned)
       |
  Native binary
  Mach-O (AOT) or
  .unc bundle (JIT)

IR: Hardware-agnostic typed tensor graph with BatchMatMul, QuantizedMatVec, RMSNorm, LayerNorm, QKNorm, RoPE, SDPA, SwiGLU, KVCacheAppend, Gather, etc. The IR is target-independent — the same graph can be lowered to Metal (current), CUDA, ROCm, WASM, or CPU-only backends with acceleration providers like Intel oneDNN.

Compiler passes: Weight binding, dead code elimination, QKV fusion, Gate+Up fusion, SwiGLU fusion, Add+RMSNorm fusion, RoPE+KV fusion, PSQ pipeline, dual-path (GEMM/GEMV), kernel matching, barrier analysis, memory planning with buffer aliasing.

1 post - 1 participant

Read full topic

🏷️ Rust_feed