3DCF/doc2dataset – Rust-based document → dataset pipeline (30+ formats, token compression, numeric integrity)

⚓ Rust    📅 2025-12-07    👤 surdeus    👁️ 1      

surdeus

Hey all :waving_hand:

I’ve been working on 3DCF/doc2dataset, a Rust-based pipeline that turns real-world documents (PDF/HTML/JSON/CSV/LaTeX, etc.) into LLM-ready datasets for RAG and fine-tuning.

The original pain: every time I wanted to build a RAG system or fine-tune a model, I had a bunch of documents but no reproducible, efficient way to turn them into training data. Lots of ad-hoc Python scripts, bloated contexts, and numbers getting silently corrupted in financial/legal docs.

So I built a core in Rust and wrapped it.


What it does (high level)

  • Ingests 30+ formats
    PDF, Markdown, plain text, HTML, XML, JSON, YAML, TOML, CSV/TSV, LaTeX, BibTeX, images via OCR, etc.
  • Token-efficient “macro-cells”
    Instead of dumping raw text, it builds layout-aware chunks with importance scores, typically giving 5–6× token compression while preserving structure.
  • NumGuard: numeric integrity
    Extracts every number, computes a hash, and tracks it through the pipeline so you can detect numeric corruption in financial/legal documents.
  • Multi-framework export
    Process once, then export to HuggingFace, LLaMA-Factory, Axolotl, OpenAI fine-tuning JSONL, and RAG triples.
  • Bindings & CLI
    Rust core with a CLI plus Python / Node.js bindings for teams that don’t use Rust directly.

Repo (Apache-2.0): GitHub - 3DCF-Labs/doc2dataset: 3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

Feedback and contributions appreciated!

1 post - 1 participant

Read full topic

🏷️ Rust_feed