Loop unrolling in manually written loops vs iterators (AVX512)

⚓ Rust    📅 2026-04-09    👤 surdeus    👁️ 4      

surdeus

Hi all! I'm currently working on a data format for delta processing on columnar data. Simply put: there is an original database column &[u64] and a "delta column" which encodes just the changes needed to transform it into an alternative version (think data versioning).

Central to this is writing efficient AVX512 kernels for processing such delta columns. For example, a sum kernel would run over the original data, and on-the-fly select the original/delta version of the data based on bitmasks.

On a high level this looks something like this (plus some extra complexity in the delta storage):

let mut accumulator = simd::u64x8::from([0; 8]);

for (vector_i, original_vector) in original_column.chunks_exact(8).enumerate() {
  let delta_payload = ...;
  let delta_mask = ...;
  let mixed_vector = Simd::load_select(
    original_vector,
    delta_mask,
    &delta_payload,
  );
  accumulator.add_assign(mixed_vector);
}
let result = accumulator.reduce_sum()

In benchmarks this performs very well, depending on the number of deltas to apply. Since Rusts iterators should be zero-cost, I expected to be able to simply put this logic into an iterator to then very easily implement any kernel over my Iterator<Item=simd::u64x8>. Then the sum logic is just: delta_column.sum::<simd::u64x8>().reduce_sum() However, performance got worse with the iterator approach, which seems to be caused by the compiler not managing to unroll the loop anymore.
Checking on Godbolt (Compiler Explorer), the compiler uses 4 zmm registers at once for the manual loop:

vmovdqu64       zmm1, zmmword ptr [rbx + r12]
vmovdqu64       zmm1 {k1}, zmmword ptr [r15 + 8*rax - 64]
kmovb   k0, byte ptr [r14 + r10 - 3]
kmovb   k1, byte ptr [r14 + r10 - 2]
lea     r12, [r15 + 8*rax - 192]
knotb   k1, k1
shl     r8, 6
vmovdqu64       zmm2, zmmword ptr [rbx + r8]
vmovdqu64       zmm2 {k1}, zmmword ptr [r15 + 8*rax - 128]
knotb   k1, k0
shl     rbp, 6
vmovdqu64       zmm3, zmmword ptr [rbx + rbp]
vmovdqu64       zmm3 {k1}, zmmword ptr [r12]
vpaddq  zmm0, zmm3, zmm0
vpaddq  zmm0, zmm2, zmm0
vpaddq  zmm0, zmm1, zmm0
shl     rdi, 6
vmovdqu64       zmm1, zmmword ptr [rbx + rdi]
kmovb   k0, byte ptr [r14 + r10]
knotb   k1, k0
vmovdqu64       zmm1 {k1}, zmmword ptr [r15 + 8*rax]
vpaddq  zmm0, zmm1, zmm0

while the iterator approach uses just one:

vpaddq  zmm1, zmm1, zmmword ptr [rsp + 128]
vmovdqa64       zmmword ptr [rsp + 192], zmm1
mov     rdi, rbx
mov     rsi, r14
vzeroupper
call    r15
vmovdqa64       zmm1, zmmword ptr [rsp + 192]
cmp     byte ptr [rsp + 64], 0
jne     .LBB1_3
vextracti64x4   ymm0, zmm1, 1
vpaddq  zmm0, zmm1, zmm0

I'm guessing this is just LLVM getting stuck in some local maxima since there are more translation steps between the iterator and the raw loop.

I'd love to have some advice on:

  • Is this a common issue for iterators?
  • Are there tricks to nudge LLVM in the right direction here?
  • Does anyone know a nice pattern for manual loop unrolling within an iterator (e.g. via keeping a temporary buffer of 4x512bit vectors?)

Thanks a lot in advance!

1 post - 1 participant

Read full topic

🏷️ Rust_feed