Building a Production Multilingual PII Detection System in Rust with GLiNER2
⚓ Rust 📅 2026-05-05 👤 surdeus 👁️ 1Building a Production Multilingual PII Detection System in Rust with GLiNER2
I want to share how we built PII Engineer, an open-source, multilingual PII detection system written entirely in Rust. It detects names, phone numbers, government IDs, addresses, and more across 50+ languages with no GPU required ![]()
This is a deep dive into the architecture, the key crates involved, and lessons learned running ONNX transformer models in Rust at production latency (~180ms p50).
Architecture Overview
HTTP Request (Axum)
↓
Language Detection (CJK check)
↓
┌─────────────────────────────────────┐
│ GLiNER2 Inference (ONNX Runtime) │
│ 5 models: encoder → span_rep → │
│ count_pred → count_embed → │
│ classifier │
│ │
│ + Chinese NER (separate model) │
└─────────────────────────────────────┘
↓
Post-Processing Pipeline (8 stages)
reclassify → validate → filter →
normalize → email/IP detect →
threshold → dedup → merge
↓
JSON Response (entities + redacted text)
The core idea: we use a fine-tuned GLiNER2 model (based on mDeBERTa-v3-base, 280M params) exported to 5 separate ONNX models, with an INT8 quantized encoder for CPU inference.
Key Crates
| Crate | Purpose |
|---|---|
ort (2.0.0-rc.9) |
ONNX Runtime bindings - runs the transformer models |
tokenizers (0.21) |
HuggingFace tokenizers - WordPiece tokenization |
ndarray (0.16) |
N-dimensional array math for tensor manipulation |
axum (0.7) |
HTTP server framework |
tokio (1.x) |
Async runtime |
mimalloc |
Memory allocator (5-10% faster for inference workloads) |
Step 1: Understanding GLiNER2's Architecture
GLiNER (Generalist and Lightweight model for Named Entity Recognition) is different from traditional NER. Instead of training fixed entity types into the model, you pass entity labels as part of the input, making it zero-shot capable.
GLiNER2 splits inference into 5 stages:
- Encoder (mDeBERTa-v3-base): encodes text + label tokens into hidden states
- Span Representation: computes representations for all possible token spans (up to
max_widthtokens) - Count Prediction: predicts how many entities of each type exist (used as a gate)
- Count Embedding : converts count predictions into embeddings
- Classifier: scores each span against each label
This decomposition lets us quantize the encoder (the bottleneck) to INT8 while keeping the smaller heads in FP32.
Step 2: Loading ONNX Models in Rust
Here's how we load the 5-model GLiNER2 pipeline:
use ort::session::{builder::GraphOptimizationLevel, Session};
use std::path::Path;
fn load_session(path: &Path, intra_threads: usize) -> Result<Session> {
Ok(Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.with_intra_threads(intra_threads)?
.with_inter_threads(1)?
.with_intra_op_spinning(true)?
.with_inter_op_spinning(false)?
.with_parallel_execution(false)?
.with_memory_pattern(true)?
.commit_from_file(path)?)
}
struct GlinerModel {
encoder: Session,
span_rep: Session,
count_pred: Session,
count_embed: Session,
classifier: Session,
tokenizer: tokenizers::Tokenizer,
max_width: usize, // max span width (8 tokens)
}
impl GlinerModel {
fn load(model_dir: &Path) -> Result<Self> {
let onnx_dir = model_dir.join("onnx");
let intra = std::env::var("ORT_INTRA_THREADS")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(4);
// Prefer INT8 encoder if available
let encoder_path = {
let int8 = onnx_dir.join("encoder_int8.onnx");
if int8.exists() { int8 } else { onnx_dir.join("encoder.onnx") }
};
let mut tokenizer = tokenizers::Tokenizer::from_file(
model_dir.join("tokenizer.json")
)?;
tokenizer.with_truncation(Some(tokenizers::TruncationParams {
max_length: 512,
..Default::default()
}))?;
Ok(Self {
encoder: load_session(&encoder_path, intra)?,
span_rep: load_session(&onnx_dir.join("span_rep.onnx"), intra)?,
count_pred: load_session(&onnx_dir.join("count_pred.onnx"), intra)?,
count_embed: load_session(&onnx_dir.join("count_embed.onnx"), intra)?,
classifier: load_session(&onnx_dir.join("classifier.onnx"), intra)?,
tokenizer,
max_width: 8,
})
}
}
Key insight: ORT_INTRA_THREADS should match your vCPU count. ONNX Runtime parallelizes within a single inference call across cores. Setting it higher than your CPU count causes contention.
Step 3: Tokenization and Input Preparation
GLiNER2 uses a special input format. The text and entity labels are concatenated with separator tokens:
[CLS] label1 [SEP] label2 [SEP] ... [SEP_TEXT] word1 word2 ... [SEP]
use regex::Regex;
struct WordSpan {
text: String,
start: usize, // char offset in original text
end: usize,
}
fn prepare_input(
tokenizer: &tokenizers::Tokenizer,
text: &str,
labels: &[String],
max_width: usize,
) -> (Vec<i64>, Vec<i64>, Vec<WordSpan>, usize, usize) {
let word_re = Regex::new(r"\w+(?:[-_]\w+)*|\S").unwrap();
// Split text into words with character offsets
let words: Vec<WordSpan> = word_re.find_iter(text)
.map(|m| WordSpan {
text: m.as_str().to_string(),
start: m.start(),
end: m.end(),
})
.collect();
// Build prompt: labels joined by [SEP], then [SEP_TEXT], then words
let label_part = labels.join(" [SEP] ");
let word_part: String = words.iter()
.map(|w| w.text.as_str())
.collect::<Vec<_>>()
.join(" ");
let prompt = format!("{label_part} [SEP_TEXT] {word_part}");
let encoding = tokenizer.encode(prompt, true).unwrap();
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
// Find where text tokens start (after [SEP_TEXT])
let sep_text_id = tokenizer.token_to_id("[SEP_TEXT]").unwrap();
let text_start = input_ids.iter().position(|&id| id == sep_text_id as i64)
.unwrap() + 1;
// Count label tokens (between first [CLS] and [SEP_TEXT])
let num_labels = labels.len();
(input_ids, attention_mask, words, text_start, num_labels)
}
Step 4: Running the 5-Stage Inference Pipeline
use ndarray::{Array2, Array3};
use ort::value::Value;
fn sigmoid(x: f32) -> f32 {
1.0 / (1.0 + (-x).exp())
}
impl GlinerModel {
fn detect(&self, text: &str, labels: &[String]) -> Result<Vec<Entity>> {
let (input_ids, attention_mask, words, text_start, num_labels) =
prepare_input(&self.tokenizer, text, labels, self.max_width);
let seq_len = input_ids.len();
// Stage 1: Encoder, the heavy computation (~120ms on INT8)
let id_array = Array2::from_shape_vec((1, seq_len), input_ids)?;
let mask_array = Array2::from_shape_vec((1, seq_len), attention_mask)?;
let encoder_out = self.encoder.run(ort::inputs![
"input_ids" => Value::from_array(id_array)?,
"attention_mask" => Value::from_array(mask_array)?,
]?)?;
let hidden_states = encoder_out[0].extract_tensor::<f32>()?;
// Shape: [1, seq_len, hidden_size]
// Stage 2: Span representation
// Extract text token embeddings and compute all spans up to max_width
let span_out = self.span_rep.run(ort::inputs![
"hidden_states" => hidden_states.clone(),
]?)?;
// Stage 3: Count prediction
let count_out = self.count_pred.run(ort::inputs![
"hidden_states" => hidden_states.clone(),
]?)?;
// Stage 4: Count embedding
let count_embed_out = self.count_embed.run(ort::inputs![
"count_logits" => count_out[0].clone(),
]?)?;
// Stage 5: Classifier, scores each span × label pair
let classifier_out = self.classifier.run(ort::inputs![
"span_reps" => span_out[0].clone(),
"label_reps" => count_embed_out[0].clone(),
]?)?;
// Decode logits into entities
let logits = classifier_out[0].extract_tensor::<f32>()?;
self.decode_spans(logits.view(), &words, labels, 0.5)
}
}
Step 5: Span Decoding
The classifier outputs logits for every possible (start, end, label) triple. We apply sigmoid and threshold:
struct RawSpan {
word_start: usize,
word_end: usize,
class_idx: usize,
score: f32,
}
fn decode_spans(
logits: &ndarray::ArrayView3<f32>, // [1, num_spans, num_labels]
words: &[WordSpan],
labels: &[String],
threshold: f32,
) -> Vec<Entity> {
let num_words = words.len();
let max_width = 8;
let num_classes = labels.len();
let mut spans: Vec<RawSpan> = Vec::new();
for s in 0..num_words {
for k in 0..max_width.min(num_words - s) {
for c in 0..num_classes {
let idx = (s * max_width + k) * num_classes + c;
let score = sigmoid(logits[[0, idx, 0]]); // simplified indexing
if score > threshold {
spans.push(RawSpan {
word_start: s,
word_end: s + k,
class_idx: c,
score,
});
}
}
}
}
// Convert word spans to character spans
spans.iter().map(|sp| {
let start_char = words[sp.word_start].start;
let end_char = words[sp.word_end].end;
Entity {
start: start_char,
end: end_char,
text: text[start_char..end_char].to_string(),
label: labels[sp.class_idx].clone(),
score: sp.score,
}
}).collect()
}
Step 6: Post-Processing Pipeline
Raw NER output is noisy. Our 8-stage pipeline cleans it:
pub fn run_pipeline(mut entities: Vec<Entity>, text: &str, cfg: &PipelineConfig) -> Vec<Entity> {
// 1. Reclassify: fix mislabeled Chinese phone numbers
entities = reclassify(entities);
// 2. Validate: check format (e.g., phone must have digits, names must have uppercase)
entities = validate_format(entities);
// 3. Filter: remove common false positives (pronouns, generic words)
entities = meta_filter(entities);
// 4. Normalize: trim whitespace, fix boundaries
entities = normalize(entities);
// 5. Detect emails: regex-based (more reliable than NER for emails)
entities.extend(detect_emails(text, &entities));
// 6. Detect IPs: regex-based with validation (no 0.0.0.0, no 255.x.x.x)
entities.extend(detect_ip_addresses(text));
// 7. Threshold: per-label confidence thresholds
entities = threshold(entities, cfg);
// 8. Dedup + merge: remove overlapping spans, merge adjacent same-type entities
entities = dedup(entities);
entities = merge_adjacent(entities, text);
entities
}
Example: meta_filter removes false positives like pronouns detected as names:
static META_WORDS: Lazy<HashSet<&str>> = Lazy::new(|| {
HashSet::from(["i", "you", "he", "she", "it", "we", "they",
"mom", "dad", "husband", "wife", "mr", "mrs", ...])
});
fn meta_filter(entities: Vec<Entity>) -> Vec<Entity> {
entities.into_iter().filter(|e| {
if e.label == "person_name" {
!META_WORDS.contains(e.text.to_lowercase().trim())
} else {
true
}
}).collect()
}
Step 7: The HTTP Server (Axum)
use axum::{routing::post, Json, Router};
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<()> {
// Use mimalloc for better multi-threaded allocation performance
#[cfg(unix)]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
let model = Arc::new(GlinerModel::load(Path::new("models/PII-Engineer-Multi-NER-v2.1"))?);
// Warm up the model (first inference is slow due to memory allocation)
model.detect("John Doe lives at 123 Main St", &default_labels())?;
let app = Router::new()
.route("/api/detect", post(detect))
.with_state(model);
let listener = tokio::net::TcpListener::bind("0.0.0.0:8000").await?;
axum::serve(listener, app).await?;
Ok(())
}
async fn detect(
State(model): State<Arc<GlinerModel>>,
Json(req): Json<DetectRequest>,
) -> Json<DetectResponse> {
// Run inference on blocking thread pool (CPU-bound work)
let result = tokio::task::spawn_blocking(move || {
model.detect(&req.text, &req.labels)
}).await.unwrap()?;
Json(DetectResponse {
entities: result,
redacted: build_redacted(&req.text, &result),
})
}
Critical: Use spawn_blocking for inference. ONNX Runtime does heavy CPU work that would block the Tokio event loop.
Step 8: Performance Optimizations
INT8 Quantization
The encoder (280M params) is the bottleneck. Quantizing to INT8 cuts inference time by ~40%:
# Export with ONNX opset 14 (important: opset 17 had accuracy issues)
python export_onnx.py --opset 14 --no-constant-folding
# Quantize encoder only
python -m onnxruntime.quantization.quantize \
--input encoder.onnx --output encoder_int8.onnx \
--per_channel --reduce_range
Memory Locking (Linux)
Prevent the OS from swapping model weights to disk:
#[cfg(unix)]
fn lock_memory() {
unsafe {
if libc::mlockall(libc::MCL_CURRENT) == 0 {
tracing::info!("model weights locked in RAM");
}
}
}
Periodic Warmup
ONNX models get evicted from CPU cache if idle. We run inference every 60s:
tokio::spawn(async move {
let mut interval = tokio::time::interval(Duration::from_secs(60));
loop {
interval.tick().await;
let model = model.clone();
tokio::task::spawn_blocking(move || model.warm_up()).await.ok();
}
});
Release Profile
[profile.release]
lto = "fat" # link-time optimization across all crates
codegen-units = 1 # single codegen unit for maximum optimization
opt-level = 3
strip = "symbols" # smaller binary
Step 9: ONNX Runtime Dynamic Loading
We use load-dynamic feature for ort, the ONNX Runtime shared library (.so/.dylib`) is loaded at runtime, not linked at compile time. This means:
- Binary works across different ONNX Runtime versions
- Users can swap in GPU-enabled builds without recompiling
- Smaller binary size
// Auto-detect libonnxruntime location at startup
if std::env::var("ORT_DYLIB_PATH").is_err() {
let candidates = [
"lib/libonnxruntime.dylib",
"lib/libonnxruntime.so",
"/usr/local/lib/libonnxruntime.so",
];
for path in candidates {
if Path::new(path).exists() {
std::env::set_var("ORT_DYLIB_PATH", path);
break;
}
}
}
Step 10: Model Fine-Tuning (LoRA)
We fine-tuned gliner2-multi-v1 (the multilingual variant, not the English-only base) using LoRA on specific layers:
- Encoder attention (Q, K, V projections): captures entity boundary patterns
- Span representation layer: learns PII-specific span features
- Classifier head: learns PII label semantics
Important: Never LoRA the full encoder (dense + FFN layers). We tried this and it destabilized the pretrained representations.
Training data: ~600 manually crafted samples across 12 countries and 8 industries, covering all 9 PII types in realistic multilingual contexts.
Results
| Metric | Score |
|---|---|
| F1 (multilingual, 13 languages) | 0.86 |
| F1 (English) | 0.88 |
| Latency (4-vCPU AMD, INT8) | ~250ms p50 |
| Latency (MacBook M-series, FP32) | ~150ms p50 |
| Memory usage | ~800MB |
| Binary size | ~15MB (+ ONNX Runtime ~60MB) |
Lessons Learned
-
ortcrate is production-ready. The 2.0 RC works well. Dynamic loading is the way to go for deployment flexibility. -
Tokenizer alignment is tricky. GLiNER uses word-level spans but the tokenizer produces subword tokens. You need careful mapping between word indices and character offsets.
-
Post-processing matters more than model accuracy. Our 8-stage pipeline improved effective F1 by ~15 points over raw model output. Simple regex for emails/IPs beats NER every time.
-
INT8 quantization is free performance. On the encoder specifically, we saw <0.5% accuracy loss with 40% speed improvement.
-
spawn_blockingis essential. A single ONNX inference call takes 150-250ms of solid CPU work. Withoutspawn_blocking, one request blocks all other connections. -
Warm up aggressively. First inference after idle is 2-3x slower due to cache misses. Periodic warmup keeps latency consistent.
-
Match
ORT_INTRA_THREADSto vCPU count. Setting it higher causes thread contention. Setting it lower leaves cores idle during inference.
Try It
git clone https://github.com/gantz-ai/pii.engineer.git
cd pii.engineer
cargo build --release --package pii-engineer-server
cargo run --release --package pii-engineer-server
# Models auto-download from HuggingFace on first run
curl -X POST http://localhost:8000/api/detect \
-H "Content-Type: application/json" \
-d '{"text": "John Doe, NRIC S9012345B, born 12 March 1985"}'
Source: https://github.com/gantz-ai/pii.engineer
Models: https://huggingface.co/pii-engineer
Happy to answer questions about ONNX Runtime in Rust, GLiNER2 architecture, or multilingual NER in general!
1 post - 1 participant
🏷️ Rust_feed
