I built a 24K-line database engine from scratch in Rust — here is my crate stack, what I learned, and where I got stuck
⚓ Rust 📅 2026-06-05 👤 surdeus 👁️ 3Hey everyone, for the past few months, I’ve been deep in the weeds building a distributed transactional SQL and KV database engine called OmniKV completely from scratch in Rust. I wanted to build the entire vertical stack myself to truly understand data layout and reliability, so this isn’t just a thin wrapper around an existing storage layer like RocksDB or SQLite. Every single component, from the LSM-tree storage, positional heap writes, and WAL up to the hand-rolled SQL parser, cost-based optimizer, Volcano executor, SSI transactions, and Raft consensus, is written from the ground up. The project is entirely open-source, and you can check out the implementation on my GitHub here: GitHub - SBALAVIGNESH123/OmniKV: Distributed transactional SQL + KV engine. PostgreSQL wire protocol. Raft consensus. Cost-based optimizer. 290 tests. 20K lines of Rust. From scratch. · GitHub. It’s sitting at around 24,000 lines of code right now with 323 integration tests passing. At 8 threads, I’m clocking about 3.69M reads/sec. The Rust community and various internals blogs have been an incredible resource while building this, so I wanted to share the exact crate decisions I made, what worked beautifully, and where I’m still hitting a wall.
I tried to keep external dependencies minimal, leaning only on foundational crates where writing it myself would just be reinventing a highly optimized wheel. For the memtable, I used crossbeam-skiplist to implement a 16-sharded, FNV-hashed lock-free SkipMap, which ensures that readers literally never block writers and gave me the biggest performance boost. For topology transitions, I used arc_swap so that when background compaction flushes new SSTables, I can publish the new metadata structure atomically while active readers just keep safely holding onto their existing Arc pointer, ensuring zero read stalls during heavy disk I/O. I went with lz4_flex for pure-Rust compression without C FFI headaches, and combined it with crc32fast to run hardware-accelerated integrity checks on every heap entry and WAL record to prevent any silent data corruption. Finally, moka handles my concurrent LRU block cache, memmap2 maps the SSTables so the OS page cache does the heavy lifting, and openraft manages the consensus protocol while I hand-rolled the PostgreSQL Wire Protocol v3 using raw tokio byte-buffers to fully understand the wire layout.
Building this really showed me what people mean by fearless concurrency. Early on, my compaction routine and memtable garbage collection were racing on the exact same shard, and the borrow checker flat-out refused to compile it. At the time, I thought it was being overly pedantic, but after mapping it out on paper, I realized it was entirely right—the interleaving would have caused silent data loss under heavy concurrent load. It took two hours to refactor, but it saved me weeks of chasing down an impossible Heisenbug. I also loved the combination of ArcSwap and atomic reference counting because managing concurrent pointer swaps on index metadata in a garbage-collected language like Go or Java can introduce subtle GC spikes or require complex volatile synchronization, whereas Rust cleans it up deterministically the millisecond readers drop out. Additionally, building the entire SQL abstract syntax tree and execution filters out of nested Rust enums made the code incredibly resilient; when I was implementing UNION and INTERSECT, adding a new variant to my enum caused the compiler to immediately flag every single place in my optimizer and executor where I hadn't explicitly handled it yet, catching three catastrophic logic bugs before the code even ran.
The SQL engine itself features a true statistics-driven cost-based optimizer and a pull-based Volcano iterator pipeline supporting complex predicates, multi-table joins, aggregates, HAVING clauses, window functions like ROW_NUMBER, and EXPLAIN ANALYZE execution tracking. To select the cheapest execution plan, the optimizer runs a statistics gathering phase that scans table data to build per-column histograms tracking distinct values and null fractions. It evaluates access methods using specific cost constants, setting sequential scans at a cost of 1.0 per row, index scans at 0.25 per row, and primary key lookups at a flat 1.0. Joins are costed by multiplying build rows by 2.0 and adding probe rows multiplied by 0.1, automatically pushing single-table predicates down through the join and placing the smaller table on the hash-build side to save memory. Selectivity estimation uses mathematical formulas where equality predicates cost 1 divided by the number of distinct values, range filters default to a 1/3 selectivity, and logical boundaries combine multiplicatively for intersections or via inclusion-exclusion for unions. This plan is compiled into a Volcano execution chain where every operator implements a single trait featuring a next_row function, allowing memory-efficient streaming where filters and projects maintain constant memory while joins only buffer the smaller build side.
The engine also implements a transaction layer with Serializable Snapshot Isolation, which is the strongest isolation level available. It detects write-write conflicts by checking if a key was modified after a transaction's snapshot sequence number, tracks read-write anti-dependencies, and aborts transactions if dangerous structural cycles form. I also built fully functional transactional savepoints, allowing the execution engine to capture snapshots of active write-sets and read-sets so that a partial rollback to a specific savepoint can restore the exact prior state without aborting the entire transaction. On the security front, the architecture is hardened with AES-256-GCM encryption at rest, Argon2id key derivation, constant-time API key comparisons, and JWT authentication for role-based access. Production-grade operational features are also baked in, including automatic MVCC garbage collection triggered during background compactions, query timeouts, a slow query log, and live cluster snapshot replication via Raft.
To be completely honest, there are still plenty of rough edges and technical debt I'm sorting through. The database runs entirely on safe Rust with zero unsafe blocks, and while I’m definitely leaving some performance wins on the table by not using raw pointers for custom allocators or unsafe mmap views, getting correctness right was my absolute highest priority. My biggest design hurdle right now is that my SeqScanIter materializes all matched rows into a heap-allocated Vec before streaming them up the Volcano pipeline; I really want to turn this into a true streaming iterator directly from the underlying storage engine, but I keep running into brutal lifetime and borrowing errors when trying to pass a borrowed iterator up through a trait object. Additionally, my SQL parser is a massive, 1,060-line wall of pure hand-rolled recursive descent logic because I didn't want to use parsing libraries, and while the storage engine is entirely synchronous to keep blocking fsync syscalls out of my core code, wrapping those loops inside tokio worker threads introduces some subtle architectural friction. Furthermore, the distributed two-phase commit protocol exists but has not been validated under coordinator crash scenarios, and the query parser and wire protocols have not yet been fuzz-tested.
Currently, on standard consumer hardware, the engine is hitting 784k ops/sec on sequential reads, 696k ops/sec on random reads, and scales up to 3.69M ops/sec with 8 threads. On the write side, my custom Group Commit v2 design groups concurrent writes to batch fsync calls, giving me an 11.3x scaling boost at 8 threads. To test durability, I wrote a chaos harness that loops through 1,000 ungraceful crash-recovery cycles right in the middle of compactions and appends garbage bytes to the WAL, and the engine successfully flags the corruption or rolls back to the last valid transaction state with zero panics or data loss.
Moving forward, I would genuinely love to get the community's advice on a few things. If you've built a pull-based iterator pipeline using the Volcano model in Rust, how did you handle streaming rows directly from disk without allocating a full vector buffer up front, and do I need to completely migrate to Generic Associated Types (GATs) for this? I'd also love to know if arc_swap is the industry standard for hot-swapping LSM tree topology metadata, or if there is a lighter-weight pattern I should look into, along with any other pure-Rust crate recommendations for high-performance disk manipulation that I might have overlooked. Again, the code is all open-source at GitHub - SBALAVIGNESH123/OmniKV: Distributed transactional SQL + KV engine. PostgreSQL wire protocol. Raft consensus. Cost-based optimizer. 290 tests. 20K lines of Rust. From scratch. · GitHub, and I'd love to hear your thoughts, structural critiques, or answer any deep-dive questions about the internals!
1 post - 1 participant
🏷️ Rust_feed