Unexpected parallel sort performance

⚓ Rust    📅 2025-07-10    👤 surdeus    👁️ 5      

surdeus

Hello Rust community!

I would love to receive some thoughts on why a safe parallel mergesort performs better than an unsafe implementation.

I compare three simple parallel merge sorts (all implementations and benchmark can be found here):

  1. Slices, which passes two halves of an array for parallel processing using split_at_mut
  2. Slices Unchecked, which does the same thing but uses unchecked operations
  3. Single Array, which operates on *mut i32 thereby does not need any splitting

What I expect:
In both parallel and sequential case, Slices is slower than both Slices Unchecked and Single Array.
What I observed after benchmarking:

  • In sequential case Slices Unchecked was 2% faster, Single Array was 6.6% faster
  • In parallel, Slices Unchecked was 4.7% faster, Single Array was 14.8% SLOWER!

Why when I switch to parallel execution, Single Array sorting method performs worse than Slice? Single Array doesn't do any index bounds checking, so doesn't Slice Unchecked, so I would assume they would perform similarly, as in sequential case. What could be the reasons?

More detailed benchmarking results

Sequential case

Array Size Slices (baseline) Single Array Diff Slices Unchecked Diff
512 16µs 15µs -7% 16µs 0%
2048 77µs 70µs -8% 76µs -1%
16384 761µs 687µs -9% 730µs -4%
65536 3290µs 3000µs -8% 3195µs -2%
250k 14ms 13ms -4% 13.9ms -0.7%
1kk 62ms 57ms -8% 60ms -3%
5kk 332ms 325ms -2% 322ms -3%
10kk 737ms 684ms -7% 716ms -3%

Parallel case

Array Size Slices (baseline) Single Array Diff Slices Unchecked Diff
250k 15ms 18ms +20% 15.2ms +1%
500k 26ms 31ms +19% 25.8ms -1%
2kk 60ms 69ms +15% 58.6ms -2%
8kk 141ms 156ms +10% 133ms -6%
16kk 210ms 235ms +11% 192ms -9%
64kk 642ms 748ms +16% 599ms -7%
100kk 990ms 1126ms +13% 901ms -9%

Some details about how I benchmark

Disclaimer: I am very new to benchmarking so please let me know if I did something wrong :slight_smile:

I run SLURM tasks running in a Singularity container with Rust 1.88 installed,
requesting 32G of memory and exclusive mode (#SBATCH --exclusive)

Framework

I use Criterion, see the benchmarking code for full details, but in short:

  • array sizes from 512 to 10kk for sequential, and from 250k to 100kk for parallel
  • use default criterion config with sample_size=20 and measurement_time=50s
  • fresh random array for every iteration
  • do not include random input generation time in measurement

Machine

Ubuntu 20.04.6 LTS

> uname -a
Linux xcs-vnode-1 5.4.0-208-generic #228-Ubuntu SMP Fri Feb 7 19:41:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • CPUs: 60
  • Threads per core: 1
  • CPU MHz: 2095.062
  • NUMA node0 CPUs: 0-59
> systemd-detect-virt 
kvm

Notes

I do not try to write a super fast parallel merge sort, I was just curious why different versions of the same algorithm perform this way.
The original project, if you are interested, came from trying to write a formally verified parallel merge sort written in Verus, you can see it here

2 posts - 2 participants

Read full topic

🏷️ rust_feed