Unexpected parallel sort performance
⚓ Rust 📅 2025-07-10 👤 surdeus 👁️ 18Hello Rust community!
I would love to receive some thoughts on why a safe parallel mergesort performs better than an unsafe implementation.
I compare three simple parallel merge sorts (all implementations and benchmark can be found here):
- Slices, which passes two halves of an array for parallel processing using
split_at_mut - Slices Unchecked, which does the same thing but uses unchecked operations
- Single Array, which operates on
*mut i32thereby does not need any splitting
What I expect:
In both parallel and sequential case, Slices is slower than both Slices Unchecked and Single Array.
What I observed after benchmarking:
- In sequential case Slices Unchecked was 2% faster, Single Array was 6.6% faster
- In parallel, Slices Unchecked was 4.7% faster, Single Array was 14.8% SLOWER!
Why when I switch to parallel execution, Single Array sorting method performs worse than Slice? Single Array doesn't do any index bounds checking, so doesn't Slice Unchecked, so I would assume they would perform similarly, as in sequential case. What could be the reasons?
More detailed benchmarking results
Sequential case
| Array Size | Slices (baseline) | Single Array | Diff | Slices Unchecked | Diff |
|---|---|---|---|---|---|
| 512 | 16µs | 15µs | -7% | 16µs | 0% |
| 2048 | 77µs | 70µs | -8% | 76µs | -1% |
| 16384 | 761µs | 687µs | -9% | 730µs | -4% |
| 65536 | 3290µs | 3000µs | -8% | 3195µs | -2% |
| 250k | 14ms | 13ms | -4% | 13.9ms | -0.7% |
| 1kk | 62ms | 57ms | -8% | 60ms | -3% |
| 5kk | 332ms | 325ms | -2% | 322ms | -3% |
| 10kk | 737ms | 684ms | -7% | 716ms | -3% |
Parallel case
| Array Size | Slices (baseline) | Single Array | Diff | Slices Unchecked | Diff |
|---|---|---|---|---|---|
| 250k | 15ms | 18ms | +20% | 15.2ms | +1% |
| 500k | 26ms | 31ms | +19% | 25.8ms | -1% |
| 2kk | 60ms | 69ms | +15% | 58.6ms | -2% |
| 8kk | 141ms | 156ms | +10% | 133ms | -6% |
| 16kk | 210ms | 235ms | +11% | 192ms | -9% |
| 64kk | 642ms | 748ms | +16% | 599ms | -7% |
| 100kk | 990ms | 1126ms | +13% | 901ms | -9% |
Some details about how I benchmark
Disclaimer: I am very new to benchmarking so please let me know if I did something wrong ![]()
I run SLURM tasks running in a Singularity container with Rust 1.88 installed,
requesting 32G of memory and exclusive mode (#SBATCH --exclusive)
Framework
I use Criterion, see the benchmarking code for full details, but in short:
- array sizes from 512 to 10kk for sequential, and from 250k to 100kk for parallel
- use default criterion config with sample_size=20 and measurement_time=50s
- fresh random array for every iteration
- do not include random input generation time in measurement
Machine
Ubuntu 20.04.6 LTS
> uname -a
Linux xcs-vnode-1 5.4.0-208-generic #228-Ubuntu SMP Fri Feb 7 19:41:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
- CPUs: 60
- Threads per core: 1
- CPU MHz: 2095.062
- NUMA node0 CPUs: 0-59
> systemd-detect-virt
kvm
Notes
I do not try to write a super fast parallel merge sort, I was just curious why different versions of the same algorithm perform this way.
The original project, if you are interested, came from trying to write a formally verified parallel merge sort written in Verus, you can see it here
2 posts - 2 participants
🏷️ rust_feed