Improving the performance of this splitting code
⚓ Rust 📅 2025-10-04 👤 surdeus 👁️ 7I'm porting an application from python to rust. It had some regex in it: ([{}\[\]<>|=&'#*;:/\\\"\-!\n]). This converts to the following rust code when hand rolled:
pub fn split_wikitext(text: &str) -> Vec<&str> {
    let mut pieces = Vec::new();
    let mut last_end = 0;
    const SPLITS: [char; 20] = [
        '{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':',
        '/', '\\', '"', '-', '!', '\n',
    ];
    for (i, ch) in text.char_indices() {
        if SPLITS.contains(&ch) {
            if last_end != i {
                pieces.push(&text[last_end..i]);
            }
            let ch_end = i + ch.len_utf8();
            pieces.push(&text[i..ch_end]);
            last_end = ch_end;
        }
    }
    if last_end < text.len() {
        pieces.push(&text[last_end..]);
    }
    pieces
}
python has slightly different regex capture semantics than the regex crate, also I don't care about the existence of empty strings in the output vec, although I'd prefer not having to do a pass to eliminate them. Profiling shows that this is one of the hotspots in my code (216 samples are spent on this function, about 10% of execution time, 11 samples are spend on RawVec::grow_one, probably from all the pushing), mainly because I'm throwing massive strings at this function and it has to go through every character one at a time. Is there a more optimal way to do this?
Anyhow I'm wondering if I'm doing this in the most optimal manner.
8 posts - 3 participants
🏷️ Rust_feed