Parsing Text

  • Parsing transforms text that’s easy for people to read into objects that are easy for computers to work with.
  • A grammar defines the textual patterns that a parser recognizes.
  • Most parsers tokenize input text and then analyze the tokens.
  • Most parsers need to implement some form of precedence to prioritize different patterns.
  • Operations like addition and function call work just like user-defined functions.
  • Programs can overload built-in operators by defining specially-named methods that are recognized by the compiler or interpreter.

Terms defined: abstract syntax tree, concrete class, CSV, grammar, JSON, operator overloading, parser, token, tokenizer, YAML

We constructed objects to match patterns in Chapter 4, but an expression like "2023-*{pdf,txt}" is a lot easier to read and write than code like Lit("2023-", Any(Either("pdf", "txt"))). If we want to use the former, we need a parser to convert those human-readable strings into machine-comprehensible objects.

Table 5.1 shows the grammar our parser will handle. When we are done, our parser should be able to recognize that 2023-*.{pdf,txt} means the literal 2023-, any characters, a literal ., and then either a literal pdf or a literal txt.

Table 5.1: Glob grammar.
Meaning Character
Any literal character c c
Zero or more characters *
Alternatives {x,y}

Please Don’t Write Parsers

Languages that are comfortable for people to read and write are usually difficult for computers to understand and vice versa, so we need parsers to translate the former into the latter. However, the world doesn’t need more file formats: please use CSV, JSON, YAML, or something else that already has an acronym rather than inventing something of your own.

Tokenizing

Most parsers are written in two parts (Figure 5.1). The first stage groups characters into atoms of text called “tokens“, which are meaningful pieces of text like the digits making up a number or the letters making up a variable name. Our grammar’s tokens are the special characters ,, {, }, and *. Any sequence of one or more other characters is a single multi-letter token. This classification determines the design of our tokenizer:

  1. If a character is not special, then append it to the current literal (if there is one) or start a new literal (if there isn’t).

  2. If a character is special, then close the existing literal (if there is one) and create a token for the special character. Note that the , character closes a literal but doesn’t produce a token.

Parsing pipeline
Figure 5.1: Stages in parsing pipeline.

The result of tokenization is a flat list of tokens. The second stage of parsing assembles tokens to create an abstract syntax tree (AST) that represents the structure of what was parsed. We will re-use the classes defined in Chapter 4 for this purpose.

Before we start writing our tokenizer, we have to decide whether to implement it as a set of functions or as one or more classes. Based on previous experience, we choose the latter: this tokenizer is simple enough that we’ll only need a handful of functions, but one capable of handling a language like Python would be much larger, and classes are a handy way to group related functions together.

The main method of our tokenizer looks like this:

    def tok(self, text):
        self._setup()
        for ch in text:
            if ch == "*":
                self._add("Any")
            elif ch == "{":
                self._add("EitherStart")
            elif ch == ",":
                self._add(None)
            elif ch == "}":
                self._add("EitherEnd")
            elif ch in CHARS:
                self.current += ch
            else:
                raise NotImplementedError(f"what is '{ch}'?")
        self._add(None)
        return self.result

This method calls self._setup() at the start so that the tokenizer can be re-used. It doesn’t call self._add() for regular characters; instead, it creates a Lit entry when it encounters a special character (i.e., when the current literal ends) and after all the input has been parsed (to capture the last literal).

The method self._add adds the current thing to the list of tokens. As a special case, self._add(None) means “add the literal but nothing else” (Figure 5.2):

    def _add(self, thing):
        if len(self.current) > 0:
            self.result.append(["Lit", self.current])
            self.current = ""
        if thing is not None:
            self.result.append([thing])
Tokenizing
Figure 5.2: Steps in tokenizing a string.

Finally, we work backward to initialize the tokenizer when we construct it and to define the set of characters that make up literals:

CHARS = set(string.ascii_letters + string.digits)

class Tokenizer:
    def __init__(self):
        self._setup()

    def _setup(self):
        self.result = []
        self.current = ""

A Simple Constant

The code fragment above defines CHARS to be a set containing ASCII letters and digits. We use a set for speed: if we used a list, Python would have to search through it each time we wanted to check a character, but finding something in a set is much faster. Chapter 17 will explain why the word “ASCII” appears in string library’s definition of characters, but using it and string.digits greatly reduces the chances of us typing "abcdeghi…yz" rather than "abcdefghi…yz". (The fact that it took you a moment to spot the missing letter ‘f’ proves this point.)

We can now write a few tests to check that the tokenizer is producing a list of lists in which each sub-list represents a single token:

def test_tok_empty_string():
    assert Tokenizer().tok("") == []


def test_tok_any_either():
    assert Tokenizer().tok("*{abc,def}") == [
        ["Any"],
        ["EitherStart"],
        ["Lit", "abc"],
        ["Lit", "def"],
        ["EitherEnd"],
    ]

Parsing

We now need to turn the list of tokens into a tree. Just as we used a class for tokenizing, we will create one for parsing and give it a _parse method to start things off. This method doesn’t do any conversion itself. Instead, it takes a token off the front of the list and figures out which method handles tokens of that kind:

    def _parse(self, tokens):
        if not tokens:
            return Null()

        front, back = tokens[0], tokens[1:]
        if front[0] == "Any": handler = self._parse_Any
        elif front[0] == "EitherStart": handler = self._parse_EitherStart
        elif front[0] == "Lit": handler = self._parse_Lit
        else:
            assert False, f"Unknown token type {front}"

        return handler(front[1:], back)

The handlers for Any and Lit are straightforward:

    def _parse_Any(self, rest, back):
        return Any(self._parse(back))

    def _parse_Lit(self, rest, back):
        return Lit(rest[0], self._parse(back))

Either is a little messier. We didn’t save the commas, so we’ll just pull two tokens and store them after checking to make sure that we actually have two tokens:

    def _parse_EitherStart(self, rest, back):
        if (
            len(back) < 3
            or (back[0][0] != "Lit")
            or (back[1][0] != "Lit")
            or (back[2][0] != "EitherEnd")
        ):
            raise ValueError("badly-formatted Either")
        left = Lit(back[0][1])
        right = Lit(back[1][1])
        return Either([left, right], self._parse(back[3:]))

An alternative approach is to take tokens from the list until we see an EitherEnd marker:

    def _parse_EitherStart(self, rest, back):
        children = []
        while back and (back[0][0] == "Lit"):
            children.append(Lit(back[0][1]))
            back = back[1:]

        if not children:
            raise ValueError("empty Either")

        if back[0][0] != "EitherEnd":
            raise ValueError("badly-formatted Either")

        return Either(children, self._parse(back[1:]))

This achieves the same thing in the two-token case but allows us to write alternatives with more options without changing the code (assuming you solved the “Multiple Alternatives” exercise in Chapter 4). Tests confirm that we’re on the right track:

def test_parse_either_two_lit():
    assert Parser().parse("{abc,def}") == Either(
        [Lit("abc"), Lit("def")]
    )

This test assumes we can compare Match objects using ==, just as we would compare numbers or strings. so we add a __eq__ method to our classes:

class Match:
    def __init__(self, rest):
        self.rest = rest if rest else Null()

    def __eq__(self, other):
        return (other is not None and 
                self.__class__ == other.__class__ and 
                self.rest == other.rest)

class Lit(Match):
    def __init__(self, chars, rest=None):
        super().__init__(rest)
        self.chars = chars

    def __eq__(self, other):
        return super().__eq__(other) and (
            self.chars == other.chars
        )

Since we’re using inheritance to implement our matchers, we write the check for equality in two parts. The parent class Match performs the checks that all classes need to perform (in this case, that the objects being compared have the same concrete class). If the child class needs to do any more checking (for example, that the characters in two Lit objects are the same) it calls up to the parent method first, then adds its own tests.

They’re Just Methods

Operator overloading relies on the fact that when Python sees a == b it calls a.__eq__(b). Similarly, a + b is “just” a call to a.__add__(b), so if we create methods with the right names, we can manipulate objects using familiar operations.

Summary

Figure 5.3 summarizes the key ideas in this chapter. Once again, while it’s useful to understand how parsers work, please don’t create new data formats that need new parsers if you can possibly avoid it.

Concept map for parser
Figure 5.3: Parser concept map.

Exercises

Escape Characters

Modify the parser to handle escape characters, so that (for example) \* is interpreted as a literal ‘*’ character and \\ is interpreted as a literal backslash.

Character Sets

Modify the parser so that expressions like [xyz] are interpreted to mean “match any one of those three characters”. (Note that this is a shorthand for {x,y,z}.)

Negation

Modify the parser so that [!abc] is interpreted as “match anything except one of those three characters”.

Nested Lists

Write a function that accepts a string representing nested lists containing numbers and returns the actual list. For example, the input [1, [2, [3, 4], 5]] should produce the corresponding Python list.

Simple Arithmetic

Write a function that accepts a string consisting of numbers and the basic arithmetic operations +, -, *, and /, and produces a nested structure showing the operations in the correct order. For example, 1 + 2 * 3 should produce ["+", 1, ["*", 2, 3]].