Sorting is slow because architectures are wrong โ€” zan-sort redesigns the architecture, not the algorithm

โš“ Rust    ๐Ÿ“… 2026-05-07    ๐Ÿ‘ค surdeus    ๐Ÿ‘๏ธ 1      

surdeus

3 days ago, I came up with an idea that some concepts used in my previous work could be applied to sorting. The idea was simple: โ€œIf a sorting engine fully exploited CPU-optimized behavior, could it be improved significantly?โ€

After experimenting, I found that the answer is yesโ€”but not by changing the algorithm. The real gains come from redesigning the architecture.

Most sorting implementations are limited not by algorithmic complexity, but by how poorly they align with modern CPU memory hierarchies. Once the ordering rule is reduced to a single numeric key, the dominant cost becomes memory traffic and DRAM bandwidth, not comparisons or O(N log N) behavior.

1. Single-pass disjoint routing (DRAM-bound regime)

Large inputs are processed in a single global routing pass. A prefix-sum phase computes disjoint write regions for each thread, enabling lock-free scatter writes with no atomics, no shared mutable state, and no synchronization. This reduces memory traffic to the theoretical minimum and saturates DRAM bandwidth.

2. Cache-hierarchy alignment

The architecture adapts to L1/L2/L3 boundaries:

  • L1-sized inputs fall back to comparison-based logic (cheaper than arithmetic routing).
  • L2-sized inputs use a Structure-of-Arrays local bucket design with bitmap-based collision resolution.
  • Larger inputs use the full routing engine with dynamic precision scaling to avoid u128 emulation overhead.

3. Minimal sufficient structure

The engine intentionally avoids mechanisms that do not increase throughput:

  • No multi-pass histograms
  • No coordination between threads
  • No atomics or locks
  • Only one routing pass and one prefix-sum phase

This minimalism is deliberate: any additional mechanism would not increase
bandwidth utilization, and any reduction would prevent saturation.

4. Hardware-driven scaling

Under identical CPU constraints (8 cores), the architecture outperforms both
parallel comparison sort and parallel radix sort by reducing memory traffic to
a single pass.

Benchmarks (Ryzen 7430U, 100M u32 elements, 8-core limit):

  • rayon::par_sort_unstable: 954 ms
  • parallel radix sort: 1.33 s
  • zan-sort: 678 ms

5. A single-key truth model

The engine abandons Ord entirely and relies on a u64 โ€œabsolute truthโ€ key. This guarantees strict O(N) routing behavior with no fallback comparisons.


Full implementation and details:


Iโ€™d be interested in feedback on the architectural approach, especially around routing design, cache alignment, and DRAM-bound scaling.

6 posts - 3 participants

Read full topic

๐Ÿท๏ธ Rust_feed