Loop unrolling in manually written loops vs iterators (AVX512)
⚓ Rust 📅 2026-04-09 👤 surdeus 👁️ 4Hi all! I'm currently working on a data format for delta processing on columnar data. Simply put: there is an original database column &[u64] and a "delta column" which encodes just the changes needed to transform it into an alternative version (think data versioning).
Central to this is writing efficient AVX512 kernels for processing such delta columns. For example, a sum kernel would run over the original data, and on-the-fly select the original/delta version of the data based on bitmasks.
On a high level this looks something like this (plus some extra complexity in the delta storage):
let mut accumulator = simd::u64x8::from([0; 8]);
for (vector_i, original_vector) in original_column.chunks_exact(8).enumerate() {
let delta_payload = ...;
let delta_mask = ...;
let mixed_vector = Simd::load_select(
original_vector,
delta_mask,
&delta_payload,
);
accumulator.add_assign(mixed_vector);
}
let result = accumulator.reduce_sum()
In benchmarks this performs very well, depending on the number of deltas to apply. Since Rusts iterators should be zero-cost, I expected to be able to simply put this logic into an iterator to then very easily implement any kernel over my Iterator<Item=simd::u64x8>. Then the sum logic is just: delta_column.sum::<simd::u64x8>().reduce_sum() However, performance got worse with the iterator approach, which seems to be caused by the compiler not managing to unroll the loop anymore.
Checking on Godbolt (Compiler Explorer), the compiler uses 4 zmm registers at once for the manual loop:
vmovdqu64 zmm1, zmmword ptr [rbx + r12]
vmovdqu64 zmm1 {k1}, zmmword ptr [r15 + 8*rax - 64]
kmovb k0, byte ptr [r14 + r10 - 3]
kmovb k1, byte ptr [r14 + r10 - 2]
lea r12, [r15 + 8*rax - 192]
knotb k1, k1
shl r8, 6
vmovdqu64 zmm2, zmmword ptr [rbx + r8]
vmovdqu64 zmm2 {k1}, zmmword ptr [r15 + 8*rax - 128]
knotb k1, k0
shl rbp, 6
vmovdqu64 zmm3, zmmword ptr [rbx + rbp]
vmovdqu64 zmm3 {k1}, zmmword ptr [r12]
vpaddq zmm0, zmm3, zmm0
vpaddq zmm0, zmm2, zmm0
vpaddq zmm0, zmm1, zmm0
shl rdi, 6
vmovdqu64 zmm1, zmmword ptr [rbx + rdi]
kmovb k0, byte ptr [r14 + r10]
knotb k1, k0
vmovdqu64 zmm1 {k1}, zmmword ptr [r15 + 8*rax]
vpaddq zmm0, zmm1, zmm0
while the iterator approach uses just one:
vpaddq zmm1, zmm1, zmmword ptr [rsp + 128]
vmovdqa64 zmmword ptr [rsp + 192], zmm1
mov rdi, rbx
mov rsi, r14
vzeroupper
call r15
vmovdqa64 zmm1, zmmword ptr [rsp + 192]
cmp byte ptr [rsp + 64], 0
jne .LBB1_3
vextracti64x4 ymm0, zmm1, 1
vpaddq zmm0, zmm1, zmm0
I'm guessing this is just LLVM getting stuck in some local maxima since there are more translation steps between the iterator and the raw loop.
I'd love to have some advice on:
- Is this a common issue for iterators?
- Are there tricks to nudge LLVM in the right direction here?
- Does anyone know a nice pattern for manual loop unrolling within an iterator (e.g. via keeping a temporary buffer of 4x512bit vectors?)
Thanks a lot in advance!
1 post - 1 participant
🏷️ Rust_feed