跳转至

使用 send() 和 return 来实现一个解析器

Note

标题:Exammple using send() and return-with-value to implement a parser
链接:https://www.csse.canterbury.ac.nz/greg.ewing/python/yield-from/yf_current/Examples/Parser/parser.txt
作者:Gregory Ewing

PEP 380 中提出了委派子生成器的新语法 yield from <expr>,允许生成器将其部分操作委派给另一个生成器。这篇文章是其附加资料中的一篇示例。

我将开发两个版本的简单 XML 解析器,第一个版本为拉取模式(pull mode),将使用现有的 Python 功能实现,拉取模式(pull mode)是指扫描器是一个生成器,解析器是一个递归函数的集合,在递归过程中拉取它所需处理的 token。

I'm going to develop two versions of a simple XML parser. The first version will use current Python facilities to do it in "pull" mode, where the scanner is a generator and the parser is a set of recursive functions that pull tokens out of it as needed.

第二个版本与第一个版本结构相反,为推送模式(push mode),这个版本中,扫描器作为一个外围的循环,解析器为一个生成器负责解析 token 并推送到外围的循环中。使用 yield from,解析器的代码和第一个版本近乎相同,主要的不同在于一些函数被 yield from 语句替代。

The second version will turn it around so that it operates in "push" mode, with the scanner being an outer loop, and the parser being a generator that gets tokens pushed into it using "push". Using yield-from, it will be seen that the parser code is almost identical to the first version, the main difference being that some function calls are replaced by yield-from statements.

版本一 拉取模式,递归函数

Version 1 - Pull-mode, recursive functions

我们先写一个扫描器并用一些测试数据运行一下。

First, let's write a scanner and set it to work on some test data.

import re
pat = re.compile(r"(\S+)|(<[^>]*>)")

def scanner(text):
    for m in pat.finditer(text):
        token = m.group(0)
        print "Feeding:", repr(token)
        yield token
        yield None # to signal EOF

text = "<foo> This is a <b> foo file </b> you know. </foo>"
token_stream = scanner(text)

这里我使用全局变量 token_stream 来保存 tokens,它也可以作为参数传递给解析函数,这样做的好处是可以更好地显示与其他版本的对称性。

I'm using a global variable 'token_stream' to hold the source of tokens here. It could just as well be passed around as a parameter to the parsing functions, but doing it this way will bring out the symmetry with the other version better later on.

现在来写一个函数去解析这些元素,每个元素要么是一个纯文本元素,要么是一个被开始标签和结束标签包裹的复合元素。我们将返回一个结构正确的解析树。

Now let's write a function to parse a sequence of items, each of which is either a plain word or a compound element surrounted by opening and closing tags. We will return the result as a suitably-structured parse tree.

def parse_items(closing_tag = None):
    elems = []
    while 1:
        token = token_stream.next()
        if not token:
            break # EOF
        if is_opening_tag(token):
            elems.append(parse_elem(token))
        elif token == closing_tag:
            break
        else:
            elems.append(token)
    return elems

写一个简单函数来判断一个 token 是否是一个开始标签。

This makes use of a small utility function to see if a token is an opening tag:

def is_opening_tag(token):
    return token.startswith("<") and not token.startswith("</")

下面这个函数返回复合元素的解析树,与解析复合元素的 parse_items 函数相互递归,它通过入参 opening_tag 构造 closing_tag 并处理之间所有的 token。

It also mutually recurses with the following function that parses a compound element. It gets passed the opening tag and eats up all tokens until the corresponding closing tag. It returns a parse tree for the compound element.

def parse_elem(opening_tag):
    name = opening_tag[1:-1]
    closing_tag = "</%s>" % name
    items = parse_items(closing_tag)
    return (name, items)

现在我们调用这个解析器来打印结果。

Now we can call the parser and print the result.

tree = parse_items()
print tree

当前版本的Python运行结果如下。

This is all currently-valid Python, and when run it produces the followiing output.

Feeding: '<foo>'
Feeding: 'This'
Feeding: 'is'
Feeding: 'a'
Feeding: '<b>'
Feeding: 'foo'
Feeding: 'file'
Feeding: '</b>'
Feeding: 'you'
Feeding: 'know.'
Feeding: '</foo>'
[('foo', ['This', 'is', 'a', ('b', ['foo', 'file']), 'you', 'know.'])]

版本二 推送模式,递归的生成器

Version 2 - Push-mode, recursive generators

在这个版本中,我们使扫描器在外层,这和第一个版本的结构相反。以下是最外层的代码:

For this version, we turn things around so that the scanner is on the outside. The top-level code will now be like this:

def run():
    parser = parse_items()
    try:
        for m in pat.finditer(text):
            token = m.group(0)
            print "Feeding:", repr(token)
            parser.send(token)
        parser.send(None) # to signal EOF
    except StopIteration, e:
        tree = e.args[0]
        print tree

解析器是一个生成器,在它退出时返回解析树。对外表现为抛出一个 StopIteration,从中我们可以获取到需要的结果。

The parser is going to be a generator that returns the parse tree when it exits. To the outside, this manifests as a StopIteration exception with an argument, which we extract.

(我们假定输入中没有语法错误,因此这个解析器可以在完整解析输入的内容后退出。更加健壮的实现是考虑输入中的语法错误。)

(We're assuming that there are no syntax errors in the input, so the parser is ready to exit by the time the input is exhausted. A more robust implementation would take care of the case that it isn't.)

现在我们需要对 parse_itemsparse_elem 这两个解析函数做的最后一件事是,将获取token的所有调用使用 yield 替换,然后将 yield from 插入到每次递归调用中。改动的地方已用箭头标出。

Now the only thing we need to do to our two parsing functions parse_items and parse_elem is to replace any calls to get a token with 'yield', and insert 'yield from' into each of the recursive calls. The changes are marked with arrows below.

def parse_elem(opening_tag):
    name = opening_tag[1:-1]
    closing_tag = "</%s>" % name
    items = yield from parse_items(closing_tag) # <---
    return (name, items)

def parse_items(closing_tag = None):
    elems = []
    while 1:
        token = yield # <---
        if not token:
            break # EOF
        if is_opening_tag(token):
            elems.append(yield from parse_elem(token)) # <---
        elif elem == closing_tag:
            break
        else:
            elems.append(token)
    return elems

搞定,保持其他代码不变。

That's all -- everything else remains the same.

我无法运行它,因为 yield-from 语句并不存在,但如果运行它结果将会和版本一相同。

I can't run this, because yield-from doesn't exist, but if I could, the output would be the same as before.

我希望这有助于理解我所说的生成器即线程(generators-as-threads)。这整个解析器可以被视为一个由一组相互递归的函数实现的、轻量的线程。每当任何函数需要 token 时,都由它来产生。同时它和其他函数一样使用 return 语句来给调用它的函数返回值。

I hope this helps to give an idea of what I'm talking about with generators-as-threads. The whole parser can be seen as a lightweight thread, implemented as a collection of mutually recursive functions. Whenever any of the functions needs to get a token, it yields. To return a value to the calling function, it uses a return statement just like any other function.

我希望说明的是为什么使用 yield 返回值,或是用 yield from 返回任何被 yield 的值是一件错误的事情,send/yield 信道和 return 信道应该被用于完全不同的目的,将它们混为一谈将是个灾难。

I hope it is also clear why returning values via yield, or having 'yield from' return any of the yielded values, would be the wrong thing to do. The send/yield channel and the return channel are being used for completely different purposes, and conflating them would be disastrous.


Note

版本一的完整代码:

# python 2.7

import re
pat = re.compile(r"(\S+)|(<[^>]*>)")

def is_opening_tag(token):
    return token.startswith("<") and not token.startswith("</")

def parse_elem(opening_tag):
    name = opening_tag[1:-1]
    closing_tag = "</%s>" % name
    items = parse_items(closing_tag)
    return (name, items)

def parse_items(closing_tag = None):
    elems = []
    while 1:
        token = token_stream.next()
        if not token:
            break # EOF
        if is_opening_tag(token):
            elems.append(parse_elem(token))
        elif token == closing_tag:
            break
        else:
            elems.append(token)
    return elems

def scanner(text):
    for m in pat.finditer(text):
        token = m.group(0)
        print "Feeding:", repr(token)
        yield token

    yield None # to signal EOF

text = "<foo> This is a <b> foo file </b> you know. </foo>"
token_stream = scanner(text)
tree = parse_items()
print tree

版本二的完整代码:

# python 3.3+

import re

pat = re.compile(r"(\S+)|(<[^>]*>)")

def is_opening_tag(token):
    return token.startswith("<") and not token.startswith("</")

def parse_elem(opening_tag):
    name = opening_tag[1:-1]
    closing_tag = "</%s>" % name
    items = yield from parse_items(closing_tag)
    return (name, items)

def parse_items(closing_tag = None):
    elems = []

    while 1:
        token = yield
        if not token:
            break # EOF
        if is_opening_tag(token):
            elem = yield from parse_elem(token)
            elems.append(elem)
        elif token == closing_tag:
            break
        else:
            elems.append(token)
    return elems

def run():
    parser = parse_items()
    # 启动生成器
    parser.send(None)
    try:
        for m in pat.finditer(text):
            token = m.group(0)
            print("Feeding:", repr(token))
            parser.send(token)
        # 结束生成器运行
        parser.send(None) # to signal EOF
    except StopIteration as e:
        tree = e.args[0]
        print(tree)

text = "<foo> This is a <b> foo file </b> you know. </foo>"
run()