词法分析：PLUS 和 NEWLINE

约 759 字大约 3 分钟

2025-12-08

词法分析

词法分析将一段程序也就是字符串转换为 token。每个 token 都有一个类型和一个值。在 TokenType 中定义了所有可能的 token 类型。其中 ILLEGAL 和 EOF 分别表示非法的 token 和文件末尾。当语法分析器扫描到 EOF 那么语法树的构建也就应该结束了。

提示

我们需要创建pyec/token.py和pyec/lexer.py 这两个文件。

token.py
- TokenType 类包含了所有可能的 token 类型
- Token 类包含了 token 的类型和值
lexer.py
- Lexer 类包含了词法分析器的实现

首先，我们改造一下 REPL，创建一个 Lexer, 接受我们的多行输入字符串。然后调用 next_token() 方法，输出每一个 token 的内容。

if line := input(
    PROMPT if not lines else PROMPT_CONTINUE
):  # read a line of input with prompt
    lines.append(line)
else:
    lexer = Lexer("\n".join(lines))
    while True:
        token = lexer.next_token()
        print(token)
        if token.type == TokenType.EOF:
            break
    lines.clear()

`token.py` 文件

TokenType 内容也很简单，它是一个枚举类型。其中EOF表示输入的结束。我们额外定义了 NEWLINE 和 WHITESPACE 这两个 token 类型。

class TokenType(Enum):
    ILLEGAL = "ILLEGAL"
    EOF = "EOF"  # end of file
    NEWLINE = "NEWLINE"  # \n, \r, \r\n
    WHITESPACE = "WHITESPACE"  # space, tab
    PLUS = "PLUS"  # +

Token 类包含了 token 的类型和值。

class Token:
    def __init__(self, token_type: TokenType, literal: str) -> None:
        self.type: TokenType = token_type
        self.literal: str = literal

`lexer.py` 文件

下面我们来看看词法分析器 Lexer，在初始化 __init__ 方法中，我们需要保存源代码字符串，并且初始化 pos 和 read_pos 为 0。

src_code 保存这源代码字符串
pos 当前位置
read_pos 表示当前读取的位置，是 pos 下一个索引位置
cur_char 表示当前读取位置的字符值
此外，在初始化方法中，我们需要调用 read_char() 方法，读取第一个字符。

class Lexer:
    def __init__(self, src_code: str) -> None:
        self.src_code: str = src_code  # source code
        self.pos: int = 0  # current position
        self.read_pos: int = 0  # reading position, position after current char
        self.cur_char: str = ""  # current char

        self.read_char()

方法 read_char() 每调用一次，就会扫描下一个字符，也就是将 pos 和 read_pos 都移动到下一个位置。由于词法分析器的这种工作方式，它也被称为 scanner。peek_char() 方法用于查看下一个字符，但是不会改变 pos 和 read_pos。

def read_char(self) -> None:
    """Read the next char and move the pos to the next char."""
    if self.read_pos >= len(self.src_code):
        self.cur_char = "\0"  # end of file
    else:
        self.cur_char = self.src_code[self.read_pos]

    self.pos = self.read_pos
    self.read_pos += 1

def peek_char(self) -> str:
    """Return the char without moving the pos."""
    if self.read_pos >= len(self.src_code):
        return "\0"  # end of file
    else:
        return self.src_code[self.read_pos]

next_token() 是词法分析器中最重要的方法。我们看到在 L17-L18，如果扫描到 +，就会返回一个叫做 PLUS 的 token。对于 NEWLINE，Windows 中的换行符是 \r\n，而 Unix 中的换行符是 \n。因此在 L10-L12, 如果我们发现 \r 后面跟着 \n，即 peek_char()得到 \n。那么我们会把 \n 也读进来，也就是调用 read_char()，然后返回 NEWLINE token。对于不认识的字符，都被归为非法字符 ILLEGAL。

def next_token(self) -> Token:
    tok = None

    match self.cur_char:
        case "\0":
            tok = Token(TokenType.EOF, "")
        case "\n":
            tok = Token(TokenType.NEWLINE, "\n")
        case "\r":
            if self.peek_char() == "\n":
                tok = Token(TokenType.NEWLINE, "\r\n")
                self.read_char()
            else:
                tok = Token(TokenType.NEWLINE, "\r")
        case " " | "\t":
            tok = Token(TokenType.WHITESPACE, self.cur_char)
        case "+":
            tok = Token(TokenType.PLUS, self.cur_char)
        case _:
            tok = Token(TokenType.ILLEGAL, self.cur_char)

    self.read_char()
    return tok

测试 REPL

现在我们来看看 REPL 输出，之前我们输入什么它就输出什么，现在词法分析器会扫描每个字符，并且输出 token。此处的数字都被当作了非法字符，后续需要我们做进一步开发。

=>> 12 + 5
 >> 3
 >>
Token(type=TokenType.ILLEGAL, literal='1')
Token(type=TokenType.ILLEGAL, literal='2')
Token(type=TokenType.WHITESPACE, literal=' ')
Token(type=TokenType.PLUS, literal='+')
Token(type=TokenType.WHITESPACE, literal=' ')
Token(type=TokenType.ILLEGAL, literal='5')
Token(type=TokenType.NEWLINE, literal='\n')
Token(type=TokenType.ILLEGAL, literal='3')
Token(type=TokenType.EOF, literal='')