Lifetime puzzle: Eliminate allocations for hashmap keys
⚓ Rust 📅 2026-03-21 👤 surdeus 👁️ 1The following is a fragment of a larger program in which intern_string is used to stash 10,000–1,000,000 word-sized strings in the storage pool, after which the index is thrown away and the rest of the program works exclusively with the pool and the Substr instances that refer to it.
use std::collections::HashMap;
#[derive(Clone, Copy, Debug, Hash, Eq, PartialEq)]
pub struct Substr {
pub start: usize,
pub len: usize,
}
pub fn intern_string(
storage: &mut String,
index: &mut HashMap<String, Substr>,
s: &str
) -> Substr {
if let Some(ss) = index.get(s) {
return *ss;
}
let start = storage.find(s).unwrap_or(storage.len());
if start == storage.len() {
storage.push_str(s);
}
let ss = Substr { start, len: s.len() };
if index.insert(s.to_owned(), ss).is_some() {
panic!("should never get here due to get() above");
}
ss
}
It seems like it ought to be possible to avoid allocating a second copy of each unique string for the hashmap keys, instead somehow referring to the copy in storage. But I can't figure out how. index cannot hold &strs pointing into storage because they would get invalidated by the push_str. And it cannot just be HashSet<Substr> because (AFAIK) that would necessitate Substr having two different Hash and Eq impls, one of which would have to take an extra parameter (namely storage).
The algorithm works if you remove index entirely and just look up all the strings in storage, but it becomes quadratic in the number of words, and it's expected that the input contains many duplicates of each word (which is why we're bothering to intern them in the first place). Replacing the entire mess with a HashSet<String> would make the rest of the program have quadratic time costs, for reasons I'd rather not get into.
What are my options for eliminating the extra allocations?
3 posts - 3 participants
🏷️ Rust_feed