How omq.rs went from 80k to 9M msg/s (optimization journal)

โš“ Rust    ๐Ÿ“… 2026-05-20    ๐Ÿ‘ค surdeus    ๐Ÿ‘๏ธ 1      

surdeus

Followup to the omq.rs announcement. I documented the full optimization path from "naive tokio actor" to "faster than libzmq at most message sizes" in How to beat libzmq. This post is the condensed version.

Disclaimer: Written with heavy assist from Claude.

Why this is hard

Large messages are easy. writev batching header + payload in one syscall vs libzmq's separate send() calls gives 2-3x above 2 KiB immediately.

Small messages (8-128 B) are the challenge. Encoding is cheap; kernel round-trips dominate. libzmq runs a dedicated I/O thread that overlaps encoding with kernel writes. A single-threaded library that encodes and writes sequentially cannot keep up without being shorter everywhere else.

The starting point

Straightforward async architecture: one actor per socket, one driver task per connection, messages routed through channels between them. Three context switches per send.

128 B TCP: 80k msg/s. libzmq: 3M. Nearly 40x behind.

The 48x jump

Most socket types (PUSH, PUB, DEALER, ...) don't modify any shared state on send. The message just needs to reach the connection. Routing it through an actor that does nothing with it costs three context switches for no reason.

Bypassing the actor and letting the sender push directly into the connection's write queue took throughput from 80k to 4M msg/s. Same idea on recv: the connection pushes into the user-facing channel directly instead of routing through the actor first.

Switching to io_uring barely moved the needle until this change. io_uring's speed only shows up when the hot path is short enough to expose it.

Batching small messages into one syscall

At 128 B, the sender was handing the kernel 1000+ tiny scatter-gather entries (2 per message: header + payload). Kernel limit is 1024 per call.

Fix: small messages go contiguously into a shared buffer. N messages become one entry for the whole batch.

128 B TCP went from 1.5M to 3.0M. Past libzmq's 2.95M for the first time.

Closing the 8 B gap: nine rounds

At this point, omq beat libzmq from 128 B up but trailed at 8 B: 3.8M vs 8.4M (0.45x). Profiling showed: 38% in codec parsing, 16% in reference counting, 18% in async overhead. Nine rounds to fix it:

Avoid unnecessary allocations. An 8-byte message doesn't need heap storage. Inline buffers (38 B capacity) eliminated Arc ref-counting entirely for small payloads. Per-frame atomic ops went from ~3 to zero.

Eliminate async overhead on the fast path. The receive cache (where decoded messages wait for the application) was behind a Mutex, costing one CAS per pop even though the runtime is single-threaded. Replaced with an UnsafeCell wrapper. Separated the data path from the control path in the codec so batch-swapping ~800 messages at once gives zero-locking pops.

Cross-crate inlining. The codec lives in a separate crate from the I/O backend. Without #[inline] annotations, the compiler couldn't inline hot functions across the boundary. One function alone (split_to) was 11.9% self time. After annotation, no LTO needed.

Fuse the decode path. Previously the codec would parse the header, allocate an intermediate representation, copy into a message struct, and push to a queue. Three copies per message. Fusing that into a single step that reads bytes directly into the final message got it down to one copy.

After all nine rounds, 8 B IPC went from 3.8M to 8.2M. Current: 8.72M vs libzmq's 8.44M (1.03x).

Dead ends

Bypass the write loop. Normally the sender pushes into a queue and a driver loop wakes up, drains everything, and writes it all in one syscall. Hundreds of messages batched into one write_vectored. I tried skipping the queue and writing directly from send, so each message hits the kernel immediately. Latency improved (165 down to 85 ยตs RTT), but throughput collapsed 7x (830k down to 115k) because every send became its own syscall. The driver loop's apparent inefficiency (queuing, periodic draining) is actually its most important feature.

TCP_CORK. Two setsockopt syscalls per flush. 10-15% regression. The coalescing it provides already comes from scatter-gather writes.

Sharing read buffers via Arc instead of copying. I expected Bytes::slice (share the Arc, no copy) to beat inline copy for small payloads. It didn't. For anything that fits in a cache line, the Arc bump + drop (~10 ns for two atomics) costs the same as just copying the bytes. Atomics are more expensive than memcpy at this scale.

Inproc: when TCP beats your in-process path

After the wire transport work, in-process message passing at 32 B ran at 2.1M msg/s, 25% slower than TCP. TCP, which serializes ZMTP frames and crosses the kernel, was faster because it batches: many small messages coalesce into one buffer, one io_uring submission, two cross-core cache-line transfers for the whole batch. The channel library (flume) was doing per-message atomics and wakeups: two cache-line round-trips per message.

Two fixes:

blume: MPSC channel that notifies only on empty-to-non-empty transitions (N sends, one wakeup) and lets the consumer swap the entire queue into a local buffer in one lock round-trip. That got inproc 32 B from 2.1M to 2.9M (+36%).

yring: per-connection SPSC ring for cross-thread transfers. One atomic per batch on each side, not per message. Inproc 8 B went from 3.1M to 16.8M (1.6x libzmq). 128 B: 12.2M (4x libzmq).

Where it stands now

128 B TCP PUSH/PULL throughput:

msg/s
libzmq 5.2.5 2.95M
omq-compio (io_uring) 5.11M (1.7x)
omq-tokio 5.58M (1.9x)

Full comparison tables: COMPARISONS.md

Design docs: architecture, compio backend, tokio backend

The full journal with profiles and measurements at each step: doc/performance.md

1 post - 1 participant

Read full topic

๐Ÿท๏ธ Rust_feed