Looking for feedback on my first tokenizer in Rust

⚓ Rust    📅 2026-04-19    👤 surdeus    👁️ 4      

surdeus

Hey everyone, I've been trying to learn about building programming languages and thought I'd make it doubly hard for myself and also learn Rust at the same time :sweat_smile:

I've been learning about the lexing/tokenizing process, and so I'm first focusing on building a small program which can handle arithmetic expressions.

I'll share the code for the tokenizer here (the file is around 60 LOC) if anyone has any feedback, and also some of my own notes as well.

#[derive(Debug)]
enum TokenType {
    INT,
    ADD,
    SUB,
    MUL,
    DIV,
    LPAREN,
    RPAREN,
}

#[derive(Debug)]
struct Token {
    token_type: TokenType,
    token_value: String,
}

fn main() {
    let tokens = tokenize("42 + 123");
    println!("{:?}", tokens);
}

fn tokenize(expr: &str) -> Vec<Token> {
    let mut tokens: Vec<Token> = vec![];
    let mut current_number = String::new();

    for c in expr.chars() {
        if c.is_ascii_digit() {
            current_number.push(c);
        } else {
            if !current_number.is_empty() {
                tokens.push(build_token(TokenType::INT, current_number.clone()));
                current_number.clear();
            }
        }

        match c {
            ' ' => continue,
            '(' => tokens.push(build_token(TokenType::LPAREN, String::from("("))),
            ')' => tokens.push(build_token(TokenType::RPAREN, String::from(")"))),
            '+' => tokens.push(build_token(TokenType::ADD, String::from("+"))),
            '-' => tokens.push(build_token(TokenType::SUB, String::from("-"))),
            '*' => tokens.push(build_token(TokenType::MUL, String::from("*"))),
            '/' => tokens.push(build_token(TokenType::DIV, String::from("/"))),
            _ => continue,
        }
    }

    if !current_number.is_empty() {
        tokens.push(build_token(TokenType::INT, current_number));
    }

    tokens
}

fn build_token(token_type: TokenType, token_value: String) -> Token {
    Token {
        token_type,
        token_value,
    }
}

Some things I've been thinking about where it could be improved:

  • Not entirely sure I need the enum for the types, or rather maybe I should use something else here?
  • Tokenizer itself feels a little messy with the if and match both being used. But I couldn't think of another way to handle multiple digits/numbers otherwise.
  • I'm parsing everything as a string which I think makes sense.
  • Thinking about if I was to extend this with other tokens like keywords, variables, etc then this feels like it could get real messy real fast. Are lexers/tokenizers really implemented this way with large match statements? Maybe there's some refactoring I could do somehow?

As mentioned, I'm very new to Rust so I'm focusing less on the intricacies of the language and memory model right now, and trying to work out the best way to express the logic for the tokenizer. But I also understand that the more I understand Rust, the likely the more idiomatic I can write it.

If anyone manages to take a look at this post, thank you!

3 posts - 3 participants

Read full topic

🏷️ Rust_feed