Matching Patterns

  • Use globs and regular expressions to match patterns in text.
  • Use inheritance to make matchers composable and extensible.
  • Simplify code by having objects delegate work to other objects.
  • Use the Null Object pattern to eliminate special cases in code.
  • Use standard refactorings to move code from one working state to another.
  • Build and check the parts of your code you are least sure of first to find out if your design will work.

Terms defined: Chain of Responsibility pattern, child class, Extract Parent Class refactoring, globbing, greedy matching, helper method, inheritance, lazy matching, literal (in parsing), Null Object pattern, refactor, regular expression, signature, test-driven development, technical debt

We used *.txt to tell the duplicate file finder of Chapter 3 which files to compare. Older programmers (like this author) refer to this kind of pattern-matching as globbing because early versions of Unix had a tool called glob to do it. Globbing was so useful that it was quickly added to the shell, and the Python standard library includes a module called glob to match filenames in the same way. For example, 2023-*.{pdf,txt} matches 2023-01.txt and 2023-final.pdf but not draft-2023.docx (Figure 4.1).

Matching examples
Figure 4.1: Examples of glob matching.

Globbing patterns are simpler than the regular expressions used to scrape data from text files, but the principles are the same. This chapter therefore implements a simple version of globbing to show how pattern-matching works in general. This matcher will only handle the cases in Table 4.1, but as the exercises will show, our design makes it easy to add new kinds of patterns.

Table 4.1: Pattern-matching cases.
Pattern Text Match? Pattern Text Match?
abc “abc” true a*c “abc” true
ab “abc” false {a,b} “a” true
abc “ab” false {a,b} “c” false
* ”“ true {a,b} “ab” false
* “abc” true *{x,y} “abcx” true

Simple Patterns

Matching is conceptually simple. If the first element of the pattern matches the target string at the current location, we check if the rest of the pattern matches what’s left of the string. If the element doesn’t match the front of the string, or if the rest of the pattern can’t match the rest of the string, matching fails. (This behavior makes globbing different from regular expressions, which can match parts of strings.)

This design makes use of the Chain of Responsibility design pattern. Each matcher matches if it can then asks the next matcher in the chain to try to match the remaining text (Figure 4.2). Crucially, objects don’t know how long the chain after them is: they just know whom to ask next.

Chain of Responsibility
Figure 4.2: Matching with Chain of Responsibility.

In some cases we only need to know what kind of matching we’re doing: for example, the * pattern matches any characters. In other cases, though, we need some extra information, such as the literal text "abc" or the two alternatives "pdf" and "txt". We therefore decide to create matching objects that can hold this extra information rather than just writing functions.

Our first matcher checks whether a piece of text like "abc" matches a string. We call this class Lit because a fixed string of characters is sometimes called a literal, and it has a constructor and a match method:

class Lit:
    def __init__(self, chars, rest=None):
        self.chars = chars
        self.rest = rest

    def match(self, text, start=0):
        end = start + len(self.chars)
        if text[start:end] != self.chars:
            return False
        if self.rest:
            return self.rest.match(text, end)
        return end == len(text)

chars is the characters to be matched, while rest is responsible for matching the rest of the text. If rest is None, this matcher is the last one in the chain, so it must match to the end of the target string.

The match method takes the text to be matched as an input along with an optional start parameter that indicates where matching is to start. This parameter has a default value of 0 (meaning “start at the beginning”), but if this Lit follows other matchers, they need to tell it where to start looking. To see if this works, let’s write and run a few tests:

def test_literal_match_entire_string():
    # /abc/ matches "abc"
    assert Lit("abc").match("abc")

def test_literal_substring_alone_no_match():
    # /ab/ doesn't match "abc"
    assert not Lit("ab").match("abc")

def test_literal_superstring_no_match():
    # /abc/ doesn't match "ab"
    assert not Lit("abc").match("ab")

Notice that we give tests long, meaningful names to make failure reports from the test runner easier to read.

We could go ahead and build some more matchers right away, but as [Petre2016] explains, good programmers build and check the parts of their code that they are least sure of as early as possible to find out if their entire design is going to work or not. We therefore write a test to make sure that chaining works when one literal matcher is followed by another:

def test_literal_followed_by_literal_match():
    # /a/+/b/ matches "ab"
    assert Lit("a", Lit("b")).match("ab")

def test_literal_followed_by_literal_no_match():
    # /a/+/b/ doesn't match "ac"
    assert not Lit("a", Lit("b")).match("ac")

Chaining two literal matchers together is unnecessary: we could (and probably should) write Lit("ab") instead of Lit("a", Lit("b")). However, the fact that these two tests pass reassures us that our design is working.

Test-Driven Development

Some programmers write the tests for a piece of code before writing the code itself. This practice is called test-driven development, and its advocates claim that it yields better code in less time because (a) writing tests helps people think about what the code should do before they’re committed to a particular implementation and (b) if people write tests first, they’ll actually write tests. However, research shows that the order in which the tests are written doesn’t actually make a difference [Fucci2016]; what actually matters is alternating short bursts of testing and coding.

These tests for Lit pass, so we’re ready to move on to wildcards. A * character in our pattern matches zero or more characters, so if there are no more matchers in the chain, then this * matches to the end of the target string and match returns True right away. If there are other matchers, on the other hand, we try matching no characters, one character, two characters, and so on and see if those other matchers can get us to the end of the string if we do so. If none of these possibilities succeeds, the overall match fails (Figure 4.3).

Wildcard matching
Figure 4.3: How wildcard matching works.
class Any:
    def __init__(self, rest=None):
        self.rest = rest

    def match(self, text, start=0):
        if self.rest is None:
            return True
        for i in range(start, len(text)):
            if self.rest.match(text, i):
                return True
        return False

Once again we write a few tests before moving on:

def test_any_matches_empty():
    # /*/ matches ""
    assert Any().match("")

def test_any_matches_entire_string():
    # /*/ matches "abc"
    assert Any().match("abc")

def test_any_matches_as_prefix():
    # /*def/ matches "abcdef"
    assert Any(Lit("def")).match("abcdef")

def test_any_matches_as_suffix():
    # /abc*/ matches "abcdef"
    assert Lit("abc", Any()).match("abcdef")

def test_any_matches_interior():
    # /a*c/ matches "abc"
    assert Lit("a", Any(Lit("c"))).match("abc")

Either/or matching works much the same way. If the first alternative matches, we try the rest of the chain. If not, we try the second alternative, and if that doesn’t work either, we fail:

class Either:
    def __init__(self, left, right, rest=None):
        self.left = left
        self.right = right
        self.rest = rest

    def match(self, text, start=0):
        return self.left.match(text, start) or \
            self.right.match(text, start)

Our first few tests pass:

def test_either_two_literals_first():
    # /{a,b}/ matches "a"
    assert Either(Lit("a"), Lit("b")).match("a")

def test_either_two_literals_not_both():
    # /{a,b}/ doesn't match "ab"
    assert not Either(Lit("a"), Lit("b")).match("ab")

but further testing uncovers a bug:

def test_either_followed_by_literal_match():
    # /{a,b}c/ matches "ac"
    assert Either(Lit("a"), Lit("b"), Lit("c")).match("ac")

def test_either_followed_by_literal_no_match():
    # /{a,b}c/ doesn't match "ax"
    assert not Either(Lit("a"), Lit("b"), Lit("c")).match("ax")
======================= test session starts ========================

test_glob_problem.py F.                                      [100%]

===================== short test summary info ======================
FAILED test_glob_problem.py::test_either_followed_by_literal_match
=================== 1 failed, 1 passed in 0.00s ====================

The problem is that Either.match isn’t using rest properly—in fact, it’s not using rest at all because it doesn’t know what to pass it as a starting point. Instead of having match methods return True or False, we need them to return an indication of where the next match should start so that Either can pass that information along to rest. Before making this change, we will clear up a bit of technical debt in our code.

Rethinking

We now have three matchers with the same interfaces. Before we do any further work, we will refactor using a pattern called Extract Parent Class to make the relationship between the matchers clear (Figure 4.4). At the same time, each matcher is checking to see if its rest is None. We can simplify this by creating a class to represent “nothing here”, which is known as the Null Object pattern.

Refactoring matchers
Figure 4.4: Using the Extract Parent Class refactoring.

We Didn’t Invent This

We didn’t invent any of the patterns or refactorings used in this chapter. Instead, we learned them from books like [Gamma1994, Fowler2018, Kerievsky2004]. And as [Tichy2010] showed, learning these patterns makes people better programmers.

Our new parent class Match looks like this:

class Match:
    def __init__(self, rest):
        self.rest = rest if rest is not None else Null()

    def match(self, text):
        result = self._match(text, 0)
        return result == len(text)

Match.rest requires every child class to have a helper method called _match that returns the location from which searching is to continue. Match.match checks whether the entire match reaches the end of the target string and returns True or False as appropriate.

Our new Null Object class looks like this:

class Null(Match):
    def __init__(self):
        self.rest = None

    def _match(self, text, start):
        return start

Null objects must be at the end of the matching chain, i.e., their rest must be None, so we remove the rest parameter from the class’s constructor and pass None up to the parent constructor every time. Since Null objects don’t match anything, Null._match immediately returns whatever starting point it was given. Every other matcher can now pass responsibility down the chain without having to test whether it’s the last matcher in line or not.

With these changes in place, our literal matcher becomes:

class Lit(Match):
    def __init__(self, chars, rest=None):
        super().__init__(rest)
        self.chars = chars

    def _match(self, text, start):
        end = start + len(self.chars)
        if text[start:end] != self.chars:
            return None
        return self.rest._match(text, end)

Lit’s constructor calls the constructor of its parent class to initialize the things that all classes share, then adds the data specific to this class. It returns None for “no match” or whatever self.rest returns If this object’s rest is an instance of Null, this result will be the index after the overall match.

As before, the matcher for * checks what happens if it matches an ever-larger part of the target string:

class Any(Match):
    def __init__(self, rest=None):
        super().__init__(rest)

    def _match(self, text, start):
        for i in range(start, len(text) + 1):
            end = self.rest._match(text, i)
            if end == len(text):
                return end
        return None

(The exercises will ask why loop has to run to len(text) + 1.) Finally, the either/or matcher that prompted this refactoring becomes:

class Either(Match):
    def __init__(self, left, right, rest=None):
        super().__init__(rest)
        self.left = left
        self.right = right

    def _match(self, text, start):
        for pat in [self.left, self.right]:
            end = pat._match(text, start)
            if end is not None:
                end = self.rest._match(text, end)
                if end == len(text):
                    return end
        return None

Looping over the left and right alternative saves us from repeating code or introducing a helper method. It also simplifies the handling of more than two options, which we explore in the exercises.

Crucially, none of the existing tests change because none of the matching classes’ constructors changed and the signature of the match method (which they now inherit from the generic Match class) stayed the same as well. We should add some tests for Null, but we have now met our original goal, and as the exercises will show we can easily add matchers for other kinds of patterns.

Summary

Figure 4.5 summarizes the key ideas in this chapter; we will see the Null Object and Chain of Responsibility design patterns again.

Concept map for regular expression matching
Figure 4.5: Regular expression matching concept map.

Exercises

Looping

Rewrite the matchers so that a top-level object manages a list of matchers, none of which know about any of the others. Is this design simpler or more complicated than the Chain of Responsibility design?

Length Plus One

Why does the upper bound of the loop in the final version of Any run to len(text) + 1?

Find One or More

Extend the regular expression matcher to support +, meaning “match one or more characters”.

Match Sets of Characters

  1. Add a new matching class that matches any character from a set, so that Charset('aeiou') matches any lower-case vowel.

  2. Create a matcher that matches a range of characters. For example, Range("a", "z") matches any single lower-case Latin alphabetic character. (This is just a convenience matcher: ranges can always be spelled out in full.)

  3. Write some tests for your matchers.

Exclusion

  1. Create a matcher that doesn’t match a specified pattern. For example, Not(Lit("abc")) only succeeds if the text isn’t “abc”.

  2. Write some tests for it.

Make Repetition More Efficient

Rewrite Any so that it does not repeatedly re-match text.

Multiple Alternatives

  1. Modify Either so that it can match any number of sub-patterns, not just two.

  2. Write some tests for it.

  3. What does your implementation do when no sub-patterns are specified?

Returning Matches

Modify the matcher so that it returns the substrings that matched each part of the expression. For example, when *.txt matches name.txt, the library should return some indication that * matched the string "name".

Alternative Matching

The tool we have built implements lazy matching, i.e., the * character matches the shortest string it can that results in the overall pattern matching. Modify the code to do greedy matching instead, and combine it with the solution to the previous exercise for testing.