3DCF/doc2dataset – Rust-based document → dataset pipeline (30+ formats, token compression, numeric integrity)

⚓ Rust 📅 2025-12-07 👤 surdeus 👁️ 11

Warning

This post was published 62 days ago. The information described in this article may have changed.

Info

This post is auto-generated from RSS feed The Rust Programming Language Forum - Latest topics. Source: 3DCF/doc2dataset – Rust-based document → dataset pipeline (30+ formats, token compression, numeric integrity)

Hey all

I’ve been working on 3DCF/doc2dataset, a Rust-based pipeline that turns real-world documents (PDF/HTML/JSON/CSV/LaTeX, etc.) into LLM-ready datasets for RAG and fine-tuning.

The original pain: every time I wanted to build a RAG system or fine-tune a model, I had a bunch of documents but no reproducible, efficient way to turn them into training data. Lots of ad-hoc Python scripts, bloated contexts, and numbers getting silently corrupted in financial/legal docs.

So I built a core in Rust and wrapped it.

What it does (high level)

Ingests 30+ formats
PDF, Markdown, plain text, HTML, XML, JSON, YAML, TOML, CSV/TSV, LaTeX, BibTeX, images via OCR, etc.
Token-efficient “macro-cells”
Instead of dumping raw text, it builds layout-aware chunks with importance scores, typically giving 5–6× token compression while preserving structure.
NumGuard: numeric integrity
Extracts every number, computes a hash, and tracks it through the pipeline so you can detect numeric corruption in financial/legal documents.
Multi-framework export
Process once, then export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning JSONL, and RAG triples.
Bindings & CLI
Rust core with a CLI plus Python / Node.js bindings for teams that don’t use Rust directly.

Repo (Apache-2.0): GitHub - 3DCF-Labs/doc2dataset: 3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

Feedback and contributions appreciated!

1 post - 1 participant

Read full topic

🏷️ Rust_feed

👍 󠁮󠁮󠁮󠁮 👎 󠁮󠁮󠁮󠁮