How to load and run local LLMs (e.g., Llama) in Rust?

โš“ Rust    ๐Ÿ“… 2025-10-22    ๐Ÿ‘ค surdeus    ๐Ÿ‘๏ธ 4      

surdeus

Hi everyone,

I'm exploring ways to run local large language models (like Meta's Llama, Llama 2, or Llama 3) directly from a Rust application. I understand that Rust doesnโ€™t have a built-in LLM module, but Iโ€™ve seen crates like llm, llama-rs, and mistralrs in the ecosystem.

My goal is to:

  • Load a quantized GGUF model file (e.g., llama-3-8b.Q4_K_M.gguf) from disk,
  • Perform text completion or chat-style inference,
  • Ideally support CPU (and optionally GPU via Metal/CUDA).

However, Iโ€™m a bit overwhelmed by the options and their documentation. Could someone share:

  1. Which crate is currently the most actively maintained and beginner-friendly for this use case?
  2. A minimal working example of loading a GGUF model and generating text?
  3. Any gotchas or performance tips (e.g., model format requirements, threading, memory usage)?

Iโ€™ve tried snippets from mistralrs and llm, but ran into issues with model compatibility or unclear API usage. Any guidance or pointers to up-to-date tutorials would be greatly appreciated!

Thanks in advance :folded_hands:

2 posts - 2 participants

Read full topic

๐Ÿท๏ธ Rust_feed