Improving the performance of this splitting code
⚓ Rust 📅 2025-10-04 👤 surdeus 👁️ 18I'm porting an application from python to rust. It had some regex in it: ([{}\[\]<>|=&'#*;:/\\\"\-!\n]). This converts to the following rust code when hand rolled:
pub fn split_wikitext(text: &str) -> Vec<&str> {
let mut pieces = Vec::new();
let mut last_end = 0;
const SPLITS: [char; 20] = [
'{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':',
'/', '\\', '"', '-', '!', '\n',
];
for (i, ch) in text.char_indices() {
if SPLITS.contains(&ch) {
if last_end != i {
pieces.push(&text[last_end..i]);
}
let ch_end = i + ch.len_utf8();
pieces.push(&text[i..ch_end]);
last_end = ch_end;
}
}
if last_end < text.len() {
pieces.push(&text[last_end..]);
}
pieces
}
python has slightly different regex capture semantics than the regex crate, also I don't care about the existence of empty strings in the output vec, although I'd prefer not having to do a pass to eliminate them. Profiling shows that this is one of the hotspots in my code (216 samples are spent on this function, about 10% of execution time, 11 samples are spend on RawVec::grow_one, probably from all the pushing), mainly because I'm throwing massive strings at this function and it has to go through every character one at a time. Is there a more optimal way to do this?
Anyhow I'm wondering if I'm doing this in the most optimal manner.
8 posts - 3 participants
🏷️ Rust_feed