Preserve the String object from File::lines, but only pass a slice on

⚓ rust    📅 2025-06-07    👤 surdeus    👁️ 3      

surdeus

Hi,

So this may very well be a recurring topic, and I apologize in advance if it has been answered many, many times before... It feels like a problem others must have stumbled upon, and maybe I should have asked before implementing my own solution from scratch :slight_smile:

So here's the thing: I'm parsing a file with a format that is heavily line-based, and I have written a deserialization function that reads an iterator of lines, treats them as string slices, and builds a structure that may contain smaller slices of those lines. Obviously, this means that the built structure has a lifetime related to the lifetime of the input data - and therein lies the problem.

If I read the full contents of the file into memory, I have a String object that I own, and I can pass str::lines to my deserialization function. However, if I want to read the file line by line, e.g. from a subprocess or from a decompression function or if the file is simply a bit too large, then there seems to be a problem:

  • BufRead::lines and similar methods return Strings - quite understandably! There is no pre-owned storage to point at!
  • if I use those as iterators, and I pass that String as a slice to my deserializtion function, the compiler obviously complains that the slice refers to a String that will be dropped very, very soon
  • ...unless I collect all the Strings into an array and keep it until I'm done, but that kind of defeats the purpose of not reading the full file's contents into memory!

So I guess what I'm asking for is, is there a way to preserve the contents of the String, but let the deserialization function use it as a slice? My first thought was to use Arc or something similar, but whatever I do in a function that handles successive lines and provides them to the iterator, any objects I create there will be dropped very, very soon, just as the String itself.

So my solution was to go off and implement a new trait that provides a very small subset of the str methods - just as much as I need for this particular project - and proxies them to bytes::Bytes internal storage that is, hopefully, always a valid UTF-8-encoded string. Of course, I'm aware of a couple of serious drawbacks of this method:

  • an implementation of this trait cannot simply be a Deref for &str, since the str methods are defined to return str objects, which would drop the ownership; I have to proxy them into returning Self instead
  • this means that I have to implement as many of the str methods as I - or others - may possibly want to use at some future point, and even though the implementations are usually trivial, it still feels sort of wrong
  • there are some str methods that I simply cannot proxy in the same way, e.g. ones using the std::str::pattern::Pattern trait, or at least I cannot implement then in a no-std library, since core::str::pattern is still experimental :slight_smile:
  • there will need to be all kinds of #[cfg(..)] shenanigans related to different Rust versions in the future; right now I'm really happy that str::from_utf8_unchecked() is stable in 1.87, and that's what I'm targetting

All that said, what I have so far is the StrLike and AdoptableStr traits and the StrOfBytes implementation in my - still unreleased - str-of-bytes library; for the moment it is only used in another still unreleased module, facet-deb822 ...but the main point of this post is to ask what have I missed, is there already an implementation of something like that - a struct that behaves as much like str as possible, but retains ownership of the data? As pointed out above, implementations that provide a Deref are not enough, since the ownership will be dropped as soon as the deserialization library invokes .split_once() or .strip_prefix() or something like that.

Of course, "oh come on, you're looking at it totally the wrong way! here's a much better way to do what you really need" answers will also be welcome :slight_smile: And yes, I know how to use parser combinators, but I think that at least nom has the same issue - a string slice in the result has no knowledge of the memory storage it refers to.

Thanks in advance for any insights!

2 posts - 2 participants

Read full topic

🏷️ rust_feed