Unexpected parallel sort performance

⚓ Rust 📅 2025-07-10 👤 surdeus 👁️ 18

Warning

This post was published 118 days ago. The information described in this article may have changed.

Info

This post is auto-generated from RSS feed The Rust Programming Language Forum - Latest topics. Source: Unexpected parallel sort performance

Hello Rust community!

I would love to receive some thoughts on why a safe parallel mergesort performs better than an unsafe implementation.

I compare three simple parallel merge sorts (all implementations and benchmark can be found here):

Slices, which passes two halves of an array for parallel processing using split_at_mut
Slices Unchecked, which does the same thing but uses unchecked operations
Single Array, which operates on *mut i32 thereby does not need any splitting

What I expect:
In both parallel and sequential case, Slices is slower than both Slices Unchecked and Single Array.
What I observed after benchmarking:

In sequential case Slices Unchecked was 2% faster, Single Array was 6.6% faster
In parallel, Slices Unchecked was 4.7% faster, Single Array was 14.8% SLOWER!

Why when I switch to parallel execution, Single Array sorting method performs worse than Slice? Single Array doesn't do any index bounds checking, so doesn't Slice Unchecked, so I would assume they would perform similarly, as in sequential case. What could be the reasons?

More detailed benchmarking results

Sequential case

Array Size	Slices (baseline)	Single Array	Diff	Slices Unchecked	Diff
512	16µs	15µs	-7%	16µs	0%
2048	77µs	70µs	-8%	76µs	-1%
16384	761µs	687µs	-9%	730µs	-4%
65536	3290µs	3000µs	-8%	3195µs	-2%
250k	14ms	13ms	-4%	13.9ms	-0.7%
1kk	62ms	57ms	-8%	60ms	-3%
5kk	332ms	325ms	-2%	322ms	-3%
10kk	737ms	684ms	-7%	716ms	-3%

Parallel case

Array Size	Slices (baseline)	Single Array	Diff	Slices Unchecked	Diff
250k	15ms	18ms	+20%	15.2ms	+1%
500k	26ms	31ms	+19%	25.8ms	-1%
2kk	60ms	69ms	+15%	58.6ms	-2%
8kk	141ms	156ms	+10%	133ms	-6%
16kk	210ms	235ms	+11%	192ms	-9%
64kk	642ms	748ms	+16%	599ms	-7%
100kk	990ms	1126ms	+13%	901ms	-9%

Some details about how I benchmark

Disclaimer: I am very new to benchmarking so please let me know if I did something wrong

I run SLURM tasks running in a Singularity container with Rust 1.88 installed,
requesting 32G of memory and exclusive mode (#SBATCH --exclusive)

Framework

I use Criterion, see the benchmarking code for full details, but in short:

array sizes from 512 to 10kk for sequential, and from 250k to 100kk for parallel
use default criterion config with sample_size=20 and measurement_time=50s
fresh random array for every iteration
do not include random input generation time in measurement

Machine

Ubuntu 20.04.6 LTS

> uname -a
Linux xcs-vnode-1 5.4.0-208-generic #228-Ubuntu SMP Fri Feb 7 19:41:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

CPUs: 60
Threads per core: 1
CPU MHz: 2095.062
NUMA node0 CPUs: 0-59

> systemd-detect-virt 
kvm

Notes

I do not try to write a super fast parallel merge sort, I was just curious why different versions of the same algorithm perform this way.
The original project, if you are interested, came from trying to write a formally verified parallel merge sort written in Verus, you can see it here

2 posts - 2 participants

Read full topic

🏷️ rust_feed

👍 󠁮󠁮󠁮󠁮 👎 󠁮󠁮󠁮󠁮