Sorting is slow because architectures are wrong โ zan-sort redesigns the architecture, not the algorithm
โ Rust ๐ 2026-05-07 ๐ค surdeus ๐๏ธ 13 days ago, I came up with an idea that some concepts used in my previous work could be applied to sorting. The idea was simple: โIf a sorting engine fully exploited CPU-optimized behavior, could it be improved significantly?โ
After experimenting, I found that the answer is yesโbut not by changing the algorithm. The real gains come from redesigning the architecture.
Most sorting implementations are limited not by algorithmic complexity, but by how poorly they align with modern CPU memory hierarchies. Once the ordering rule is reduced to a single numeric key, the dominant cost becomes memory traffic and DRAM bandwidth, not comparisons or O(N log N) behavior.
1. Single-pass disjoint routing (DRAM-bound regime)
Large inputs are processed in a single global routing pass. A prefix-sum phase computes disjoint write regions for each thread, enabling lock-free scatter writes with no atomics, no shared mutable state, and no synchronization. This reduces memory traffic to the theoretical minimum and saturates DRAM bandwidth.
2. Cache-hierarchy alignment
The architecture adapts to L1/L2/L3 boundaries:
- L1-sized inputs fall back to comparison-based logic (cheaper than arithmetic routing).
- L2-sized inputs use a Structure-of-Arrays local bucket design with bitmap-based collision resolution.
- Larger inputs use the full routing engine with dynamic precision scaling to avoid u128 emulation overhead.
3. Minimal sufficient structure
The engine intentionally avoids mechanisms that do not increase throughput:
- No multi-pass histograms
- No coordination between threads
- No atomics or locks
- Only one routing pass and one prefix-sum phase
This minimalism is deliberate: any additional mechanism would not increase
bandwidth utilization, and any reduction would prevent saturation.
4. Hardware-driven scaling
Under identical CPU constraints (8 cores), the architecture outperforms both
parallel comparison sort and parallel radix sort by reducing memory traffic to
a single pass.
Benchmarks (Ryzen 7430U, 100M u32 elements, 8-core limit):
- rayon::par_sort_unstable: 954 ms
- parallel radix sort: 1.33 s
- zan-sort: 678 ms
5. A single-key truth model
The engine abandons Ord entirely and relies on a u64 โabsolute truthโ key. This guarantees strict O(N) routing behavior with no fallback comparisons.
Full implementation and details:
Iโd be interested in feedback on the architectural approach, especially around routing design, cache alignment, and DRAM-bound scaling.
6 posts - 3 participants
๐ท๏ธ Rust_feed