词法分析：运算符

约 1450 字大约 5 分钟

2026-02-17

有了之前 Token 的定义，我们可以开始着手实现运算符词法分析。

首先，我们看到IllegalCharErr异常，它用于表示遇到了未知的字符。我们在之前的介绍中曾经提到过，语言有自己的字母表 alphabet。在我们的实现中，字母表包括数字、小数点和运算符。如果看到了字母，那么它不在我们的字母表中，这时候就应该抛出IllegalCharErr异常。

接下来看Lexer的实现。这次我们反过来，先看我们的测试用例。

read_char() 方法是词法分析器中最基本最重要的方法。它的逻辑很简单

当词法分析器被初始化的时候，pos指向src_code的[0]。
read_char()每调用一次pos就指向下一个字符，column也会增加 1。
但是pos已经指向了len(src_code) + 100，它就不再移动了。

与read_char()配合的是cur_char()和peek_char()。

cur_char()返回当前指向的字符。但是当pos已经指向了len(src_code)，它就返回\0。
peek_char()返回下一个字符。但是当pos + 1已经指向了len(src_code)，它就返回\0。
\0会后续转换成EOF

我们举个例子，假设输入的字符串是abc。

初始化的时候，pos指向[0]，也就是a，column为1。此时cur_char()返回a，peek_char()返回b。
调用read_char()一次，pos指向[1]，也就是b，column为2。此时cur_char()返回b，peek_char()返回c。
调用read_char()两次，pos指向[2]，也就是c，column为3。此时cur_char()返回c，peek_char()返回\0。
调用read_char()三次，pos指向[3]，也就是\0，column为4。此时cur_char()返回\0，peek_char()返回\0。
调用read_char()四次，pos指向[4]，也就是\0，column为5。此时cur_char()返回\0，peek_char()返回\0。
由于调用 3 次的时候，pos指向了[3]，已经到字符串的尾部，所以调用 3 次和调用 4 次，调用 100 次结果是一样的。

下面的测试test_lexer_read_char()就是用来表达这个测试思想的。其中times表示调用read_char()的次数。L20-L22 初始化了一个Lexer对象，然后用 for 循环调用read_char() times次。有了测试，无论我们怎么修改read_char()的实现，都不会影响到这个测试。都可以通过测试的结果判断实现是不是正确。

tests/test_lexer.py

@pytest.mark.parametrize(
    "src_code, times, pos, col, cur_char, peek_char",
    [
        ("abc", 0, 0, 1, "a", "b"),
        ("123", 1, 1, 2, "2", "3"),
        ("456", 2, 2, 3, "6", "\0"),
        ("789", 3, 3, 4, "\0", "\0"),
        ("", 0, 0, 1, "\0", "\0"),
        ("456", 100, 3, 4, "\0", "\0"),
    ],
)
def test_lexer_read_char(
    src_code: str,
    times: int,
    pos: int,
    col: int,
    cur_char: str,
    peek_char: str,
) -> None:
    lexer = Lexer(src_code)
    for _ in range(times):
        lexer.read_char()

    assert (
        lexer.cur_char == cur_char
    ), f"cur_char: expected='{cur_char}', actual='{lexer.cur_char}'"
    assert (
        lexer.cur_char == cur_char
    ), f"peek_char: expected='{peek_char}', actual='{lexer.peek_char}'"
    assert lexer.pos == pos, f"pos: expected={pos}, actual={lexer.pos}"
    assert lexer.column == col, f"col: expected={col}, actual={lexer.column}"

测试了read_char()，我们可以来讨论一下next_token()。回到词法分析器的输入输出：输入字符串，输出是一个符号流。我们之前处理的就是输入字符串，next_token()的输出就是一个一个的符号。test_lexer_next_token()测试了next_token()的实现。参数部分比较长，我们直接跳到 L123 行，这里有一个上下文管理器 context manager 的概念。它在这里用来捕捉非法输入。如前所述，遇到了一个非法输入a，会抛出IllegalCharErr。为了能够对错误的输入进行测试，我们用这个上下文管理器来查看是否有IllegalCharErr抛出。如果没有它会报错。L124 是对词法分析器的迭代。每取出一个符号，也会产生这个符号在这个序列中的索引 index，我们利用这个索引从expected_tokens去得当前索引所对应的期望值，和实际值相比较。循环结束之后，我们在 L129 比较了一下两个序列的长度，看看是不是所有的期望值都已经使用了。可能你已经注意到，这个过程中没有提到next_token()。它在魔术方法__iter__()中实现，并且通过enumerate()调用。

现在来看测试用例。正确的输入，我们不会产生任何的异常，在expected_err位置是do_not_raise()。非法输入会有pytest.raises(IllegalCharErr, match="...")捕捉异常。其中match="..."表示我们期望的异常信息，注意这里是个正则表达式的匹配。譬如match=r"^Illegal character: 'a' at line 1, col 1$"表示我们期望的异常信息是Illegal character: 'a' at line 1, col 1，如果我们没有加$，那么col 11也会被认为是匹配上了。我们的测试只覆盖了运算符、空白字符、回车和文件结束符。其它的字符不再我们目前实现的字母表中。也就是说数字和字母都会被认为非法。

tests/test_lexer.py

@pytest.mark.parametrize(
    "src_code, expected_tokens, expected_error",
    [
        (
            "",
            [
                Token(TokenType.EOF, "\0", 1, 1),
            ],
            does_not_raise(),
        ),
        (
            " \t",
            [
                Token(TokenType.WHITESPACE, " ", 1, 1),
                Token(TokenType.WHITESPACE, "\t", 1, 2),
                Token(TokenType.EOF, "\0", 1, 3),
            ],
            does_not_raise(),
        ),
        (
            "\n \r\n",
            [
                Token(TokenType.NEWLINE, "\n", 1, 1),
                Token(TokenType.WHITESPACE, " ", 2, 1),
                Token(TokenType.NEWLINE, "\r\n", 2, 2),
                Token(TokenType.EOF, "\0", 3, 1),
            ],
            does_not_raise(),
        ),
        (
            "\n + \r\n",
            [
                Token(TokenType.NEWLINE, "\n", 1, 1),
                Token(TokenType.WHITESPACE, " ", 2, 1),
                Token(TokenType.PLUS, "+", 2, 2),
                Token(TokenType.WHITESPACE, " ", 2, 3),
                Token(TokenType.NEWLINE, "\r\n", 2, 4),
                Token(TokenType.EOF, "\0", 3, 1),
            ],
            does_not_raise(),
        ),
        (
            "a",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: 'a' at line 1, col 1$"
            ),
        ),
        (
            "abc",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: 'a' at line 1, col 1$"
            ),
        ),
        (
            "1",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 1$"
            ),
        ),
        (
            "123",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 1$"
            ),
        ),
        (
            "12.3",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 1$"
            ),
        ),
        (
            "12.",  # Decimal accept this as Decimal('12')
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 1$"
            ),
        ),
        (
            "12.a",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 1$"
            ),
        ),
        (
            "1 +6",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 1$"
            ),
        ),
        (
            "+-*/^()",
            [
                Token(TokenType.PLUS, "+", 1, 1),
                Token(TokenType.MINUS, "-", 1, 2),
                Token(TokenType.ASTERISK, "*", 1, 3),
                Token(TokenType.SLASH, "/", 1, 4),
                Token(TokenType.CARET, "^", 1, 5),
                Token(TokenType.LPAREN, "(", 1, 6),
                Token(TokenType.RPAREN, ")", 1, 7),
                Token(TokenType.EOF, "\0", 1, 8),
            ],
            does_not_raise(),
        ),
        (
            "(1 + 2) * 3 - 4/ 5^6",
            [
                Token(TokenType.LPAREN, "(", 1, 1),
            ],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '1' at line 1, col 2$"
            ),
        ),
        (
            "3.2 / 5.3",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '3' at line 1, col 1$"
            ),
        ),
        (
            "3.2.3",
            [],
            pytest.raises(
                IllegalCharErr, match=r"^Illegal character: '3' at line 1, col 1$"
            ),
        ),
    ],
)
def test_lexer_next_token(
    src_code: str,
    expected_tokens: list[Token],
    expected_error: ContextManager[any],
) -> None:
    lexer = Lexer(src_code)

    with expected_error:
        for index, token in enumerate(lexer):
            expected_token = expected_tokens[index]
            assert (
                token == expected_token
            ), f"token: expected={repr(expected_token)}', got={repr(token)}"
        assert index + 1 == len(
            expected_tokens
        ), f"len: expected={len(expected_tokens)}, got={index + 1}"

看完测试用例，我们可以来实现next_token()了。其实就是一个大的 match 语句。对于运算符来说只有一个字符，实现起来比较简单。这里稍微复杂的是换行，我们认为\r\n也是一个换行符，因此我们使用了peek_char，如果当前字符是\r那么我们偷看下一个是不是\n。如果不是就返回一个只有\r的符号，如果是就返回一个\r\n的符号。当前的字符用完了，需要调用read_char()继续读取下一个字符。如果是数字或字母，就会落到_分支上，抛出IllegalCharErr异常。此外，在处理换行的时候，要将 line 加 1，column 重置为 1。之前提到过__iter__()，它在 L97。测试中有一个for index, token in enumerate(lexer)可能比较难理解。如果我们写成for token in lexer，那么每循环一次就会调用一下next_token()，返回一个符号。enumerate()仅仅是在迭代的符号序列中加了一个索引罢了。

src/pyec/lexer.py

class LexErr(Exception):
    def __init__(self, char: str, line: int, col: int) -> None:
        self.char = char
        self.line = line
        self.col = col

    def __repr__(self) -> str:
        raise NotImplementedError

    def __str__(self) -> str:
        return self.__repr__()


class IllegalCharErr(LexErr):
    def __repr__(self) -> str:
        return (
            f"Illegal character: {repr(self.char)} at line {self.line}, col {self.col}"
        )

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, LexErr):
            return False
        return (
            self.char == other.char
            and self.line == other.line
            and self.col == other.col
        )


class Lexer:
    def __init__(self, src_code: str) -> None:
        self.src_code: str = src_code  # source code
        self.pos: int = 0  # current position
        self.line: int = 1  # current line number
        self.column: int = 1  # current column number

    def read_char(
        self,
    ) -> None:  # read the next char and move the pos to the next char
        if self.pos < len(self.src_code):
            self.pos += 1
            self.column += 1

    @property
    def cur_char(self) -> str:
        return "\0" if self.pos >= len(self.src_code) else self.src_code[self.pos]

    @property
    def peek_char(self) -> str:
        return (
            "\0" if self.pos + 1 >= len(self.src_code) else self.src_code[self.pos + 1]
        )

    def next_token(self) -> Token:
        # save current position
        cur_line = self.line
        cur_col = self.column

        match self.cur_char:
            case "\0":
                tok = Token(TokenType.EOF, "\0", cur_line, cur_col)
            case "\n":
                tok = Token(TokenType.NEWLINE, "\n", cur_line, cur_col)
                self.line += 1
                self.column = 0
            case "\r":
                if self.peek_char == "\n":
                    tok = Token(TokenType.NEWLINE, "\r\n", cur_line, cur_col)
                    self.read_char()
                else:
                    tok = Token(TokenType.NEWLINE, "\r", cur_line, cur_col)
                self.line += 1
                self.column = 0
            case " " | "\t":
                tok = Token(TokenType.WHITESPACE, self.cur_char, cur_line, cur_col)
            case "+":
                tok = Token(TokenType.PLUS, self.cur_char, cur_line, cur_col)
            case "-":
                tok = Token(TokenType.MINUS, self.cur_char, cur_line, cur_col)
            case "*":
                tok = Token(TokenType.ASTERISK, self.cur_char, cur_line, cur_col)
            case "/":
                tok = Token(TokenType.SLASH, self.cur_char, cur_line, cur_col)
            case "^":
                tok = Token(TokenType.CARET, self.cur_char, cur_line, cur_col)
            case "(":
                tok = Token(TokenType.LPAREN, self.cur_char, cur_line, cur_col)
            case ")":
                tok = Token(TokenType.RPAREN, self.cur_char, cur_line, cur_col)
            case _:
                tok = Token(TokenType.ILLEGAL, self.cur_char, cur_line, cur_col)
                raise IllegalCharErr(self.cur_char, cur_line, cur_col)

        self.read_char()
        return tok

    def __iter__(self) -> Iterator[Token]:
        while True:
            tok = self.next_token()
            yield tok
            if tok.type == TokenType.EOF:
                break

如果所有的测试都调试通过了，那么我们也可以运行next()了。在此之前，需要提交代码到 Git 仓库。

$ git add .
$ git commit -m "lexer: operator"