Are vectorization failures due to Rust or LLVM?
⚓ Rust 📅 2025-11-19 👤 surdeus 👁️ 11Here's a simplified piece of extremely performance-sensitive decompression code I would like to use:
pub unsafe fn decompress_offsets(
base_bit_idx: usize,
src: &[u8],
offset_bits_csum_scratch: &[u32],
offset_bits_scratch: &[u32],
latents: &mut [u64],
) {
for (&offset_bits, (&offset_bits_csum, latent)) in offset_bits_scratch.iter().zip(
offset_bits_csum_scratch
.iter()
.zip(latents.iter_mut()),
) {
let bit_idx = base_bit_idx as u32 + offset_bits_csum;
let byte_idx = bit_idx / 8;
let bits_past_byte = bit_idx % 8;
*latent = read_u64_at(
src,
byte_idx as usize,
bits_past_byte,
offset_bits,
).wrapping_add(*latent);
}
}
#[inline]
unsafe fn read_u64_at(
src: &[u8],
byte_idx: usize,
bits_past_byte: u32,
n: u32,
) -> u64 {
debug_assert!(n <= 57);
let raw_bytes = *(src.as_ptr().add(byte_idx) as *const [u8; 8]);
let value = u64::from_le_bytes(raw_bytes);
(value >> bits_past_byte) & ((1 << n) - 1)
}
This vectorizes on x64 but fails to do so on aarch64. I can get some very similar loops to vectorize, if I
- remove the final wrapping add, or
- write to another dst: &mut [u64] buffer instead of working in-place.
However, I would rather not do those things for performance reasons, and in reality I have several generic versions of this loop, so I can't easily write inline assembly.
Things I've tried:
- looked at the LLVM IR. The vectorizing versions have a vector.body section, but I'm not sure if rustc produces that or LLVM does and I'm just looking at IR after all the optimization passes.
- looked at the assembly on both platforms. It appears to me that what I want is definitely possible by tweaking the assembly from (1.) above.
So how can I tell if a vectorization failure is due to Rust or LLVM? If the former, how can we improve the compiler in this case? Are there any good workarounds for the moment?
2 posts - 2 participants
🏷️ Rust_feed