Improving the performance of this splitting code

⚓ Rust    📅 2025-10-04    👤 surdeus    👁️ 18      

surdeus

Warning

This post was published 107 days ago. The information described in this article may have changed.

I'm porting an application from python to rust. It had some regex in it: ([{}\[\]<>|=&'#*;:/\\\"\-!\n]). This converts to the following rust code when hand rolled:

pub fn split_wikitext(text: &str) -> Vec<&str> {
    let mut pieces = Vec::new();
    let mut last_end = 0;
    const SPLITS: [char; 20] = [
        '{', '}', '[', ']', '<', '>', '|', '=', '&', '\'', '#', '*', ';', ':',
        '/', '\\', '"', '-', '!', '\n',
    ];
    for (i, ch) in text.char_indices() {
        if SPLITS.contains(&ch) {
            if last_end != i {
                pieces.push(&text[last_end..i]);
            }
            let ch_end = i + ch.len_utf8();
            pieces.push(&text[i..ch_end]);
            last_end = ch_end;
        }
    }
    if last_end < text.len() {
        pieces.push(&text[last_end..]);
    }
    pieces
}

python has slightly different regex capture semantics than the regex crate, also I don't care about the existence of empty strings in the output vec, although I'd prefer not having to do a pass to eliminate them. Profiling shows that this is one of the hotspots in my code (216 samples are spent on this function, about 10% of execution time, 11 samples are spend on RawVec::grow_one, probably from all the pushing), mainly because I'm throwing massive strings at this function and it has to go through every character one at a time. Is there a more optimal way to do this?

Anyhow I'm wondering if I'm doing this in the most optimal manner.

8 posts - 3 participants

Read full topic

🏷️ Rust_feed