MetaXuda: Metal GPU runtime for ML on Apple Silicon (1.1 TOPS with Tokio async)

⚓ Rust    📅 2026-01-18    👤 surdeus    👁️ 5      

surdeus

Hey Rustaceans! :waving_hand:

I built MetaXuda - a native GPU runtime for machine learning on Apple Silicon, entirely in Rust.

Motivation:
Got tired of "buy Windows for ML" advice. Most ML libraries are CUDA-only with zero macOS GPU support. Translation layers like ZLUDA add overhead, so I built from scratch using Metal.

Tech Stack:

  • Rust core with Tokio async runtime
  • Metal for GPU acceleration
  • PyO3 for Python bindings (cuda_pipeline.so)
  • Arrow-based in-kernel quantization
  • Multi-tier memory manager (GPU → RAM → SSD)

Performance:

  • 1.1 TOPS throughput (95% of M3 Max theoretical peak)
  • 230+ GPU operations (math, transform, ML primitives)
  • 93.37% GPU utilization cap (prevents macOS starvation)
  • Zero race conditions via centralized scheduler

Architecture Highlights:

  • Migrated from sync → async (40+ iterations to get it right!)
  • Stream managers + thread-pool groups coordinated by scheduler
  • Handles 100GB+ workloads through intelligent memory tiering
  • CUDA-compatible API naming for library interop

Current Status:

  • Works with Numba (bypasses execution path)
  • pip install metaxuda
  • Toolkit integration (scikit-learn, XGBoost) coming next
  • CUDA API coverage still in progress

Known Challenges:

  • Apple's Metal stream limits are undocumented (reverse-engineered what I could)
  • Some intentional blocking favors stability over raw speed
  • ~1-in-million scheduler notification misses (rare edge case)

Links:

Looking for feedback on:

  • Async scheduler design patterns (Tokio + Metal coordination)
  • Memory tier eviction strategies
  • Anyone hitting Apple GPU quirks I should know about?

License inquiries: p.perinban@gmail.com

Would love thoughts from the community, especially on the Rust/async architecture choices!

1 post - 1 participant

Read full topic

🏷️ Rust_feed