Missing utf8 parsing features in the standard library

⚓ Rust 📅 2025-08-05 👤 surdeus 👁️ 11

Warning

This post was published 120 days ago. The information described in this article may have changed.

Info

This post is auto-generated from RSS feed The Rust Programming Language Forum - Latest topics. Source: Missing utf8 parsing features in the standard library

I recently had to write byte slice to some fmt::Write sink, where those bytes should be utf-8, and if they are not, lossy conversion should be performed. I didn't want to allocate memory for this operation, so String::from_utf8_lossy was not an option. I ended up creating something like this (comments omitted), which is almost one-to-one copy of example shown in the documentation of std::str::Utf8Error:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
struct UTF8LossyWriter<'a>(&'a [u8]);

impl fmt::Display for UTF8LossyWriter<'_> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        let mut data = self.0;
        loop {
            match std::str::from_utf8(data) {
                Ok(s) => {
                    f.write_str(s)?;
                    break;
                }
                Err(e) => {
                    let (valid, after_valid) = data.split_at(e.valid_up_to());
                    let valid_str = unsafe { std::str::from_utf8_unchecked(valid) };
                    f.write_str(valid_str)?;
                    f.write_char(char::REPLACEMENT_CHARACTER)?;

                    if let Some(invalid_sequence_length) = e.error_len() {
                        data = &after_valid[invalid_sequence_length..];
                    } else {
                        break;
                    }
                }
            }
        }

        Ok(())
    }
}

This code is pretty straight forward, but it has one caveat (which I will argue is a problem). In order to skip re-parsing of valid utf-8 data in the Err branch i have to call an unsafe function from_utf8_unchecked. This of course is inherently safe (as long as you split the input correctly), but does require using unsafe keyword. I was quite surprised that there is no function in the standard library that abstracts over this kind of operation (parsing chunks of bytes and returning valid string sub-slices). There are a couple of reasons why I find this problematic:

Some people are allergic to unsafe. Part of Rust community will do anything to avoid writing unsafe code. This is partly based of some irrational (maybe this is to strong word, I don't know how to better say this) fear that you shouldn't use unsafe unless you have to. Some people will reject this implementation during code review with feedback: I know that this is faster, but let's just call from_utf8 again "just in case".
Some codebases use #[deny(unsafe_code)] (or even forbid). Writing #[allow(unsafe_code)] for particular functions sort of defeats the purpose of enabling those lints.
This does increase the cognitive burden of the code. Even though it is quite small and even if code is properly documented with SAFETY comments, one can still worry what will happen in the future.
Discoverability of Utf8Error::valid_up_to + str::from_utf8_unchecked is quite low. If you know standard library by heart, then this won't be any problem. But if you only began learning Rust, then probably you will find str::from_utf8 which documentation does mention str::from_utf8_unchecked, but it does not tell anything about additional information provided by Utf8Error.
Lastly this feels like something which should be provided by the standard library. It is full of safe abstractions over unsafe operations, and this one I believe should also be present. For example I think that two problems:
1. I have a &mut [T] and want to split it at index i into two mutable slices.
2. I have &[u8] and want to parse it as utf-8 as long as I can, and handle all invalid bytes.
Share some things in common. They describe a concrete logical operation that is impossible to implement (efficiently or at all) without unsafe code. Having this burden taken by the standard library will be beneficial to all users, who will now have a simple safe API that handles this one case and allows them not to worry about implementation details.

After my initial surprise that this is not provided by the standard library I started to contemplate this issue and realised that actually it is not obvious how the API of such function look like. It has to have following properties:

parse as much data as possible and return &str
when it encounters invalid bytes it has to somehow report how many of them there are
must be flexible enough to be useful in different scenarions
must not be slower than manually calling from_utf8_unsafe

Additionally some things can be done in multiple ways. So it is an open question how should it handle:

empty input slices (return empty slice back, or None)
advancing input slice (done by caller or callee)
returning data (as slices or as indexes)

Those are a couple of examples of how this could be done, which I thought of:

enum Chunk<'a> {
    Str(&'a str),
    Invalid(&'a [u8]),
    /// Possible another variant for specifying end of input.
    /// Only for `from_utf8` and `from_utf8_winnow_style` functions.
    /// alternatively they could return `Option<Chunk>`.
    Finished,
}

/// User must call this again with input advanced
fn from_utf8(_: &[u8]) -> Chunk<'_> {
    todo!()
}

/// This function handles advancement and "consumes" whole slice
fn from_utf8_iter(_: &[u8]) -> impl Iterator<Item = Chunk<'_>> {
    todo!()
}

/// This function handles advancement
/// and must be called again with the same reference
fn from_utf8_winnow_style<'a>(_: &mut &'a [u8]) -> Chunk<'a> {
    todo!()
}

I am curious to know what you think about this topic. Would you also want such functionality in the standard library? Do you know why it is not there? Was there ever a discussion about this? Is this good place to discuss adding such features to the standard library, or should this be moved to somewhere else like IRLO, tulip or github?

3 posts - 2 participants

Read full topic

🏷️ Rust_feed

👍 󠁮󠁮󠁮󠁮 👎 󠁮󠁮󠁮󠁮