Building a Production Multilingual PII Detection System in Rust with GLiNER2

⚓ Rust    📅 2026-05-05    👤 surdeus    👁️ 1      

surdeus

Building a Production Multilingual PII Detection System in Rust with GLiNER2

banner

I want to share how we built PII Engineer, an open-source, multilingual PII detection system written entirely in Rust. It detects names, phone numbers, government IDs, addresses, and more across 50+ languages with no GPU required :laughing:

This is a deep dive into the architecture, the key crates involved, and lessons learned running ONNX transformer models in Rust at production latency (~180ms p50).


Architecture Overview

HTTP Request (Axum)
    ↓
Language Detection (CJK check)
    ↓
┌─────────────────────────────────────┐
│  GLiNER2 Inference (ONNX Runtime)   │
│  5 models: encoder → span_rep →     │
│  count_pred → count_embed →         │
│  classifier                         │
│                                     │
│  + Chinese NER (separate model)     │
└─────────────────────────────────────┘
    ↓
Post-Processing Pipeline (8 stages)
    reclassify → validate → filter →
    normalize → email/IP detect →
    threshold → dedup → merge
    ↓
JSON Response (entities + redacted text)

The core idea: we use a fine-tuned GLiNER2 model (based on mDeBERTa-v3-base, 280M params) exported to 5 separate ONNX models, with an INT8 quantized encoder for CPU inference.


Key Crates

Crate Purpose
ort (2.0.0-rc.9) ONNX Runtime bindings - runs the transformer models
tokenizers (0.21) HuggingFace tokenizers - WordPiece tokenization
ndarray (0.16) N-dimensional array math for tensor manipulation
axum (0.7) HTTP server framework
tokio (1.x) Async runtime
mimalloc Memory allocator (5-10% faster for inference workloads)

Step 1: Understanding GLiNER2's Architecture

GLiNER (Generalist and Lightweight model for Named Entity Recognition) is different from traditional NER. Instead of training fixed entity types into the model, you pass entity labels as part of the input, making it zero-shot capable.

GLiNER2 splits inference into 5 stages:

  1. Encoder (mDeBERTa-v3-base): encodes text + label tokens into hidden states
  2. Span Representation: computes representations for all possible token spans (up to max_width tokens)
  3. Count Prediction: predicts how many entities of each type exist (used as a gate)
  4. Count Embedding : converts count predictions into embeddings
  5. Classifier: scores each span against each label

This decomposition lets us quantize the encoder (the bottleneck) to INT8 while keeping the smaller heads in FP32.


Step 2: Loading ONNX Models in Rust

Here's how we load the 5-model GLiNER2 pipeline:

use ort::session::{builder::GraphOptimizationLevel, Session};
use std::path::Path;

fn load_session(path: &Path, intra_threads: usize) -> Result<Session> {
    Ok(Session::builder()?
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        .with_intra_threads(intra_threads)?
        .with_inter_threads(1)?
        .with_intra_op_spinning(true)?
        .with_inter_op_spinning(false)?
        .with_parallel_execution(false)?
        .with_memory_pattern(true)?
        .commit_from_file(path)?)
}

struct GlinerModel {
    encoder: Session,
    span_rep: Session,
    count_pred: Session,
    count_embed: Session,
    classifier: Session,
    tokenizer: tokenizers::Tokenizer,
    max_width: usize,  // max span width (8 tokens)
}

impl GlinerModel {
    fn load(model_dir: &Path) -> Result<Self> {
        let onnx_dir = model_dir.join("onnx");
        let intra = std::env::var("ORT_INTRA_THREADS")
            .ok()
            .and_then(|v| v.parse().ok())
            .unwrap_or(4);

        // Prefer INT8 encoder if available
        let encoder_path = {
            let int8 = onnx_dir.join("encoder_int8.onnx");
            if int8.exists() { int8 } else { onnx_dir.join("encoder.onnx") }
        };

        let mut tokenizer = tokenizers::Tokenizer::from_file(
            model_dir.join("tokenizer.json")
        )?;
        tokenizer.with_truncation(Some(tokenizers::TruncationParams {
            max_length: 512,
            ..Default::default()
        }))?;

        Ok(Self {
            encoder: load_session(&encoder_path, intra)?,
            span_rep: load_session(&onnx_dir.join("span_rep.onnx"), intra)?,
            count_pred: load_session(&onnx_dir.join("count_pred.onnx"), intra)?,
            count_embed: load_session(&onnx_dir.join("count_embed.onnx"), intra)?,
            classifier: load_session(&onnx_dir.join("classifier.onnx"), intra)?,
            tokenizer,
            max_width: 8,
        })
    }
}

Key insight: ORT_INTRA_THREADS should match your vCPU count. ONNX Runtime parallelizes within a single inference call across cores. Setting it higher than your CPU count causes contention.


Step 3: Tokenization and Input Preparation

GLiNER2 uses a special input format. The text and entity labels are concatenated with separator tokens:

[CLS] label1 [SEP] label2 [SEP] ... [SEP_TEXT] word1 word2 ... [SEP]
use regex::Regex;

struct WordSpan {
    text: String,
    start: usize,  // char offset in original text
    end: usize,
}

fn prepare_input(
    tokenizer: &tokenizers::Tokenizer,
    text: &str,
    labels: &[String],
    max_width: usize,
) -> (Vec<i64>, Vec<i64>, Vec<WordSpan>, usize, usize) {
    let word_re = Regex::new(r"\w+(?:[-_]\w+)*|\S").unwrap();

    // Split text into words with character offsets
    let words: Vec<WordSpan> = word_re.find_iter(text)
        .map(|m| WordSpan {
            text: m.as_str().to_string(),
            start: m.start(),
            end: m.end(),
        })
        .collect();

    // Build prompt: labels joined by [SEP], then [SEP_TEXT], then words
    let label_part = labels.join(" [SEP] ");
    let word_part: String = words.iter()
        .map(|w| w.text.as_str())
        .collect::<Vec<_>>()
        .join(" ");
    let prompt = format!("{label_part} [SEP_TEXT] {word_part}");

    let encoding = tokenizer.encode(prompt, true).unwrap();
    let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
    let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();

    // Find where text tokens start (after [SEP_TEXT])
    let sep_text_id = tokenizer.token_to_id("[SEP_TEXT]").unwrap();
    let text_start = input_ids.iter().position(|&id| id == sep_text_id as i64)
        .unwrap() + 1;

    // Count label tokens (between first [CLS] and [SEP_TEXT])
    let num_labels = labels.len();

    (input_ids, attention_mask, words, text_start, num_labels)
}

Step 4: Running the 5-Stage Inference Pipeline

use ndarray::{Array2, Array3};
use ort::value::Value;

fn sigmoid(x: f32) -> f32 {
    1.0 / (1.0 + (-x).exp())
}

impl GlinerModel {
    fn detect(&self, text: &str, labels: &[String]) -> Result<Vec<Entity>> {
        let (input_ids, attention_mask, words, text_start, num_labels) =
            prepare_input(&self.tokenizer, text, labels, self.max_width);

        let seq_len = input_ids.len();

        // Stage 1: Encoder, the heavy computation (~120ms on INT8)
        let id_array = Array2::from_shape_vec((1, seq_len), input_ids)?;
        let mask_array = Array2::from_shape_vec((1, seq_len), attention_mask)?;

        let encoder_out = self.encoder.run(ort::inputs![
            "input_ids" => Value::from_array(id_array)?,
            "attention_mask" => Value::from_array(mask_array)?,
        ]?)?;

        let hidden_states = encoder_out[0].extract_tensor::<f32>()?;
        // Shape: [1, seq_len, hidden_size]

        // Stage 2: Span representation
        // Extract text token embeddings and compute all spans up to max_width
        let span_out = self.span_rep.run(ort::inputs![
            "hidden_states" => hidden_states.clone(),
        ]?)?;

        // Stage 3: Count prediction
        let count_out = self.count_pred.run(ort::inputs![
            "hidden_states" => hidden_states.clone(),
        ]?)?;

        // Stage 4: Count embedding
        let count_embed_out = self.count_embed.run(ort::inputs![
            "count_logits" => count_out[0].clone(),
        ]?)?;

        // Stage 5: Classifier, scores each span × label pair
        let classifier_out = self.classifier.run(ort::inputs![
            "span_reps" => span_out[0].clone(),
            "label_reps" => count_embed_out[0].clone(),
        ]?)?;

        // Decode logits into entities
        let logits = classifier_out[0].extract_tensor::<f32>()?;
        self.decode_spans(logits.view(), &words, labels, 0.5)
    }
}

Step 5: Span Decoding

The classifier outputs logits for every possible (start, end, label) triple. We apply sigmoid and threshold:

struct RawSpan {
    word_start: usize,
    word_end: usize,
    class_idx: usize,
    score: f32,
}

fn decode_spans(
    logits: &ndarray::ArrayView3<f32>,  // [1, num_spans, num_labels]
    words: &[WordSpan],
    labels: &[String],
    threshold: f32,
) -> Vec<Entity> {
    let num_words = words.len();
    let max_width = 8;
    let num_classes = labels.len();
    let mut spans: Vec<RawSpan> = Vec::new();

    for s in 0..num_words {
        for k in 0..max_width.min(num_words - s) {
            for c in 0..num_classes {
                let idx = (s * max_width + k) * num_classes + c;
                let score = sigmoid(logits[[0, idx, 0]]);  // simplified indexing
                if score > threshold {
                    spans.push(RawSpan {
                        word_start: s,
                        word_end: s + k,
                        class_idx: c,
                        score,
                    });
                }
            }
        }
    }

    // Convert word spans to character spans
    spans.iter().map(|sp| {
        let start_char = words[sp.word_start].start;
        let end_char = words[sp.word_end].end;
        Entity {
            start: start_char,
            end: end_char,
            text: text[start_char..end_char].to_string(),
            label: labels[sp.class_idx].clone(),
            score: sp.score,
        }
    }).collect()
}

Step 6: Post-Processing Pipeline

Raw NER output is noisy. Our 8-stage pipeline cleans it:

pub fn run_pipeline(mut entities: Vec<Entity>, text: &str, cfg: &PipelineConfig) -> Vec<Entity> {
    // 1. Reclassify: fix mislabeled Chinese phone numbers
    entities = reclassify(entities);
    // 2. Validate: check format (e.g., phone must have digits, names must have uppercase)
    entities = validate_format(entities);
    // 3. Filter: remove common false positives (pronouns, generic words)
    entities = meta_filter(entities);
    // 4. Normalize: trim whitespace, fix boundaries
    entities = normalize(entities);
    // 5. Detect emails: regex-based (more reliable than NER for emails)
    entities.extend(detect_emails(text, &entities));
    // 6. Detect IPs: regex-based with validation (no 0.0.0.0, no 255.x.x.x)
    entities.extend(detect_ip_addresses(text));
    // 7. Threshold: per-label confidence thresholds
    entities = threshold(entities, cfg);
    // 8. Dedup + merge: remove overlapping spans, merge adjacent same-type entities
    entities = dedup(entities);
    entities = merge_adjacent(entities, text);
    entities
}

Example: meta_filter removes false positives like pronouns detected as names:

static META_WORDS: Lazy<HashSet<&str>> = Lazy::new(|| {
    HashSet::from(["i", "you", "he", "she", "it", "we", "they",
                   "mom", "dad", "husband", "wife", "mr", "mrs", ...])
});

fn meta_filter(entities: Vec<Entity>) -> Vec<Entity> {
    entities.into_iter().filter(|e| {
        if e.label == "person_name" {
            !META_WORDS.contains(e.text.to_lowercase().trim())
        } else {
            true
        }
    }).collect()
}

Step 7: The HTTP Server (Axum)

use axum::{routing::post, Json, Router};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<()> {
    // Use mimalloc for better multi-threaded allocation performance
    #[cfg(unix)]
    #[global_allocator]
    static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

    let model = Arc::new(GlinerModel::load(Path::new("models/PII-Engineer-Multi-NER-v2.1"))?);

    // Warm up the model (first inference is slow due to memory allocation)
    model.detect("John Doe lives at 123 Main St", &default_labels())?;

    let app = Router::new()
        .route("/api/detect", post(detect))
        .with_state(model);

    let listener = tokio::net::TcpListener::bind("0.0.0.0:8000").await?;
    axum::serve(listener, app).await?;
    Ok(())
}

async fn detect(
    State(model): State<Arc<GlinerModel>>,
    Json(req): Json<DetectRequest>,
) -> Json<DetectResponse> {
    // Run inference on blocking thread pool (CPU-bound work)
    let result = tokio::task::spawn_blocking(move || {
        model.detect(&req.text, &req.labels)
    }).await.unwrap()?;

    Json(DetectResponse {
        entities: result,
        redacted: build_redacted(&req.text, &result),
    })
}

Critical: Use spawn_blocking for inference. ONNX Runtime does heavy CPU work that would block the Tokio event loop.


Step 8: Performance Optimizations

INT8 Quantization

The encoder (280M params) is the bottleneck. Quantizing to INT8 cuts inference time by ~40%:

# Export with ONNX opset 14 (important: opset 17 had accuracy issues)
python export_onnx.py --opset 14 --no-constant-folding
# Quantize encoder only
python -m onnxruntime.quantization.quantize \
    --input encoder.onnx --output encoder_int8.onnx \
    --per_channel --reduce_range

Memory Locking (Linux)

Prevent the OS from swapping model weights to disk:

#[cfg(unix)]
fn lock_memory() {
    unsafe {
        if libc::mlockall(libc::MCL_CURRENT) == 0 {
            tracing::info!("model weights locked in RAM");
        }
    }
}

Periodic Warmup

ONNX models get evicted from CPU cache if idle. We run inference every 60s:

tokio::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_secs(60));
    loop {
        interval.tick().await;
        let model = model.clone();
        tokio::task::spawn_blocking(move || model.warm_up()).await.ok();
    }
});

Release Profile

[profile.release]
lto = "fat"        # link-time optimization across all crates
codegen-units = 1  # single codegen unit for maximum optimization
opt-level = 3
strip = "symbols"  # smaller binary

Step 9: ONNX Runtime Dynamic Loading

We use load-dynamic feature for ort, the ONNX Runtime shared library (.so/.dylib`) is loaded at runtime, not linked at compile time. This means:

  • Binary works across different ONNX Runtime versions
  • Users can swap in GPU-enabled builds without recompiling
  • Smaller binary size
// Auto-detect libonnxruntime location at startup
if std::env::var("ORT_DYLIB_PATH").is_err() {
    let candidates = [
        "lib/libonnxruntime.dylib",
        "lib/libonnxruntime.so",
        "/usr/local/lib/libonnxruntime.so",
    ];
    for path in candidates {
        if Path::new(path).exists() {
            std::env::set_var("ORT_DYLIB_PATH", path);
            break;
        }
    }
}

Step 10: Model Fine-Tuning (LoRA)

We fine-tuned gliner2-multi-v1 (the multilingual variant, not the English-only base) using LoRA on specific layers:

  • Encoder attention (Q, K, V projections): captures entity boundary patterns
  • Span representation layer: learns PII-specific span features
  • Classifier head: learns PII label semantics

Important: Never LoRA the full encoder (dense + FFN layers). We tried this and it destabilized the pretrained representations.

Training data: ~600 manually crafted samples across 12 countries and 8 industries, covering all 9 PII types in realistic multilingual contexts.


Results

Metric Score
F1 (multilingual, 13 languages) 0.86
F1 (English) 0.88
Latency (4-vCPU AMD, INT8) ~250ms p50
Latency (MacBook M-series, FP32) ~150ms p50
Memory usage ~800MB
Binary size ~15MB (+ ONNX Runtime ~60MB)

Lessons Learned

  1. ort crate is production-ready. The 2.0 RC works well. Dynamic loading is the way to go for deployment flexibility.

  2. Tokenizer alignment is tricky. GLiNER uses word-level spans but the tokenizer produces subword tokens. You need careful mapping between word indices and character offsets.

  3. Post-processing matters more than model accuracy. Our 8-stage pipeline improved effective F1 by ~15 points over raw model output. Simple regex for emails/IPs beats NER every time.

  4. INT8 quantization is free performance. On the encoder specifically, we saw <0.5% accuracy loss with 40% speed improvement.

  5. spawn_blocking is essential. A single ONNX inference call takes 150-250ms of solid CPU work. Without spawn_blocking, one request blocks all other connections.

  6. Warm up aggressively. First inference after idle is 2-3x slower due to cache misses. Periodic warmup keeps latency consistent.

  7. Match ORT_INTRA_THREADS to vCPU count. Setting it higher causes thread contention. Setting it lower leaves cores idle during inference.


Try It

git clone https://github.com/gantz-ai/pii.engineer.git
cd pii.engineer
cargo build --release --package pii-engineer-server
cargo run --release --package pii-engineer-server
# Models auto-download from HuggingFace on first run
curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{"text": "John Doe, NRIC S9012345B, born 12 March 1985"}'

Source: https://github.com/gantz-ai/pii.engineer
Models: https://huggingface.co/pii-engineer


Happy to answer questions about ONNX Runtime in Rust, GLiNER2 architecture, or multilingual NER in general!

1 post - 1 participant

Read full topic

🏷️ Rust_feed