Why does string not use CodeUnit and CodePoint types?

โš“ Rust    ๐Ÿ“… 2026-02-22    ๐Ÿ‘ค surdeus    ๐Ÿ‘๏ธ 1      

surdeus

Asking out of curiosity, why does the String interface use Characters instead of Code Points, and Bytes instead of Code Units?

My understanding is that that a Character (is also a Glyph) is the visual element, and Code Points are the index to the visual elements. Since the code itself is not the visual part of the string, it seems that what ever renders is concerned with Characters. This also makes me wonder about composite Characters where two codepoints are combined to form a Character. (I.e. See: Combining character - Wikipedia). Would this not create confusion since it is up to there renderer (terminal, gui, etc) to determine how to rendering the specific combinations of Code Points?

Similar commentary around using Bytes, Shorts and Wordโ€™s. The term Code Units are used to indicate the smallest unit that a code point can be stored as. In the case of UTF8, that is a byte (u8), UTF16 that is a short (u16) and UTF32 that is a Word (u32). However noting that the code units have specific validation requirements. Since we are all about safety and security, does it not make more sense to present interfaces that implement a CodeUnit trait and define types such as type UTF8CodeUnit = u8;?

3 posts - 3 participants

Read full topic

๐Ÿท๏ธ Rust_feed