3DCF/doc2dataset – Rust-based document → dataset pipeline (30+ formats, token compression, numeric integrity)
⚓ Rust 📅 2025-12-07 👤 surdeus 👁️ 1Hey all ![]()
I’ve been working on 3DCF/doc2dataset, a Rust-based pipeline that turns real-world documents (PDF/HTML/JSON/CSV/LaTeX, etc.) into LLM-ready datasets for RAG and fine-tuning.
The original pain: every time I wanted to build a RAG system or fine-tune a model, I had a bunch of documents but no reproducible, efficient way to turn them into training data. Lots of ad-hoc Python scripts, bloated contexts, and numbers getting silently corrupted in financial/legal docs.
So I built a core in Rust and wrapped it.
What it does (high level)
- Ingests 30+ formats
PDF, Markdown, plain text, HTML, XML, JSON, YAML, TOML, CSV/TSV, LaTeX, BibTeX, images via OCR, etc. - Token-efficient “macro-cells”
Instead of dumping raw text, it builds layout-aware chunks with importance scores, typically giving 5–6× token compression while preserving structure. - NumGuard: numeric integrity
Extracts every number, computes a hash, and tracks it through the pipeline so you can detect numeric corruption in financial/legal documents. - Multi-framework export
Process once, then export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning JSONL, and RAG triples. - Bindings & CLI
Rust core with a CLI plus Python / Node.js bindings for teams that don’t use Rust directly.
Repo (Apache-2.0): GitHub - 3DCF-Labs/doc2dataset: 3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.
Feedback and contributions appreciated!
1 post - 1 participant
🏷️ Rust_feed