Matching Patterns
- Use globs and regular expressions to match patterns in text.
- Use inheritance to make matchers composable and extensible.
- Simplify code by having objects delegate work to other objects.
- Use the Null Object pattern to eliminate special cases in code.
- Use standard refactorings to move code from one working state to another.
- Build and check the parts of your code you are least sure of first to find out if your design will work.
Terms defined: Chain of Responsibility pattern, child class, Extract Parent Class refactoring, globbing, greedy matching, helper method, inheritance, lazy matching, literal (in parsing), Null Object pattern, refactor, regular expression, signature, test-driven development, technical debt
We used *.txt
to tell the duplicate file finder of Chapter 3 which files to compare.
Older programmers (like this author) refer to this kind of pattern-matching as globbing
because early versions of Unix had a tool called glob
to do it.
Globbing was so useful that it was quickly added to the shell,
and the Python standard library includes a module called glob
to match filenames in the same way.
For example,
2023-*.{pdf,txt}
matches 2023-01.txt
and 2023-final.pdf
but not draft-2023.docx
(Figure 4.1).
Globbing patterns are simpler than the regular expressions used to scrape data from text files, but the principles are the same. This chapter therefore implements a simple version of globbing to show how pattern-matching works in general. This matcher will only handle the cases in Table 4.1, but as the exercises will show, our design makes it easy to add new kinds of patterns.
Pattern | Text | Match? | Pattern | Text | Match? |
---|---|---|---|---|---|
abc |
“abc” | true | a*c |
“abc” | true |
ab |
“abc” | false | {a,b} |
“a” | true |
abc |
“ab” | false | {a,b} |
“c” | false |
* |
”“ | true | {a,b} |
“ab” | false |
* |
“abc” | true | *{x,y} |
“abcx” | true |
Simple Patterns
Matching is conceptually simple. If the first element of the pattern matches the target string at the current location, we check if the rest of the pattern matches what’s left of the string. If the element doesn’t match the front of the string, or if the rest of the pattern can’t match the rest of the string, matching fails. (This behavior makes globbing different from regular expressions, which can match parts of strings.)
This design makes use of the Chain of Responsibility design pattern. Each matcher matches if it can then asks the next matcher in the chain to try to match the remaining text (Figure 4.2). Crucially, objects don’t know how long the chain after them is: they just know whom to ask next.
In some cases we only need to know what kind of matching we’re doing:
for example, the *
pattern matches any characters.
In other cases, though,
we need some extra information,
such as the literal text "abc"
or the two alternatives "pdf"
and "txt"
.
We therefore decide to create matching objects that can hold this extra information
rather than just writing functions.
Our first matcher checks whether a piece of text like "abc"
matches a string.
We call this class Lit
because a fixed string of characters
is sometimes called a literal,
and it has a constructor and a match
method:
class Lit:
def __init__(self, chars, rest=None):
self.chars = chars
self.rest = rest
def match(self, text, start=0):
end = start + len(self.chars)
if text[start:end] != self.chars:
return False
if self.rest:
return self.rest.match(text, end)
return end == len(text)
chars
is the characters to be matched,
while rest
is responsible for matching the rest of the text.
If rest
is None
,
this matcher is the last one in the chain,
so it must match to the end of the target string.
The match
method takes the text to be matched as an input
along with an optional start
parameter
that indicates where matching is to start.
This parameter has a default value of 0
(meaning “start at the beginning”),
but if this Lit
follows other matchers,
they need to tell it where to start looking.
To see if this works,
let’s write and run a few tests:
def test_literal_match_entire_string():
# /abc/ matches "abc"
assert Lit("abc").match("abc")
def test_literal_substring_alone_no_match():
# /ab/ doesn't match "abc"
assert not Lit("ab").match("abc")
def test_literal_superstring_no_match():
# /abc/ doesn't match "ab"
assert not Lit("abc").match("ab")
Notice that we give tests long, meaningful names to make failure reports from the test runner easier to read.
We could go ahead and build some more matchers right away, but as [Petre2016] explains, good programmers build and check the parts of their code that they are least sure of as early as possible to find out if their entire design is going to work or not. We therefore write a test to make sure that chaining works when one literal matcher is followed by another:
def test_literal_followed_by_literal_match():
# /a/+/b/ matches "ab"
assert Lit("a", Lit("b")).match("ab")
def test_literal_followed_by_literal_no_match():
# /a/+/b/ doesn't match "ac"
assert not Lit("a", Lit("b")).match("ac")
Chaining two literal matchers together is unnecessary:
we could (and probably should) write Lit("ab")
instead of Lit("a", Lit("b"))
.
However,
the fact that these two tests pass reassures us that our design is working.
Test-Driven Development
Some programmers write the tests for a piece of code before writing the code itself. This practice is called test-driven development, and its advocates claim that it yields better code in less time because (a) writing tests helps people think about what the code should do before they’re committed to a particular implementation and (b) if people write tests first, they’ll actually write tests. However, research shows that the order in which the tests are written doesn’t actually make a difference [Fucci2016]; what actually matters is alternating short bursts of testing and coding.
These tests for Lit
pass,
so we’re ready to move on to wildcards.
A *
character in our pattern matches zero or more characters,
so if there are no more matchers in the chain,
then this *
matches to the end of the target string
and match
returns True
right away.
If there are other matchers,
on the other hand,
we try matching no characters, one character, two characters, and so on
and see if those other matchers can get us to the end of the string if we do so.
If none of these possibilities succeeds,
the overall match fails
(Figure 4.3).
class Any:
def __init__(self, rest=None):
self.rest = rest
def match(self, text, start=0):
if self.rest is None:
return True
for i in range(start, len(text)):
if self.rest.match(text, i):
return True
return False
Once again we write a few tests before moving on:
def test_any_matches_empty():
# /*/ matches ""
assert Any().match("")
def test_any_matches_entire_string():
# /*/ matches "abc"
assert Any().match("abc")
def test_any_matches_as_prefix():
# /*def/ matches "abcdef"
assert Any(Lit("def")).match("abcdef")
def test_any_matches_as_suffix():
# /abc*/ matches "abcdef"
assert Lit("abc", Any()).match("abcdef")
def test_any_matches_interior():
# /a*c/ matches "abc"
assert Lit("a", Any(Lit("c"))).match("abc")
Either/or matching works much the same way. If the first alternative matches, we try the rest of the chain. If not, we try the second alternative, and if that doesn’t work either, we fail:
class Either:
def __init__(self, left, right, rest=None):
self.left = left
self.right = right
self.rest = rest
def match(self, text, start=0):
return self.left.match(text, start) or \
self.right.match(text, start)
Our first few tests pass:
def test_either_two_literals_first():
# /{a,b}/ matches "a"
assert Either(Lit("a"), Lit("b")).match("a")
def test_either_two_literals_not_both():
# /{a,b}/ doesn't match "ab"
assert not Either(Lit("a"), Lit("b")).match("ab")
but further testing uncovers a bug:
def test_either_followed_by_literal_match():
# /{a,b}c/ matches "ac"
assert Either(Lit("a"), Lit("b"), Lit("c")).match("ac")
def test_either_followed_by_literal_no_match():
# /{a,b}c/ doesn't match "ax"
assert not Either(Lit("a"), Lit("b"), Lit("c")).match("ax")
======================= test session starts ========================
test_glob_problem.py F. [100%]
===================== short test summary info ======================
FAILED test_glob_problem.py::test_either_followed_by_literal_match
=================== 1 failed, 1 passed in 0.00s ====================
The problem is that Either.match
isn’t using rest
properly—in fact,
it’s not using rest
at all
because it doesn’t know what to pass it as a starting point.
Instead of having match
methods return True
or False
,
we need them to return an indication of where the next match should start
so that Either
can pass that information along to rest
.
Before making this change,
we will clear up a bit of technical debt in our code.
Rethinking
We now have three matchers with the same interfaces.
Before we do any further work,
we will refactor
using a pattern called Extract Parent Class
to make the relationship between the matchers clear (Figure 4.4).
At the same time,
each matcher is checking to see if its rest
is None
.
We can simplify this by creating a class to represent “nothing here”,
which is known as the Null Object pattern.
We Didn’t Invent This
We didn’t invent any of the patterns or refactorings used in this chapter. Instead, we learned them from books like [Gamma1994, Fowler2018, Kerievsky2004]. And as [Tichy2010] showed, learning these patterns makes people better programmers.
Our new parent class Match
looks like this:
class Match:
def __init__(self, rest):
self.rest = rest if rest is not None else Null()
def match(self, text):
result = self._match(text, 0)
return result == len(text)
Match.rest
requires every child class to have
a helper method called _match
that returns the location from which searching is to continue.
Match.match
checks whether the entire match reaches the end of the target string
and returns True
or False
as appropriate.
Our new Null Object class looks like this:
class Null(Match):
def __init__(self):
self.rest = None
def _match(self, text, start):
return start
Null
objects must be at the end of the matching chain,
i.e.,
their rest
must be None
,
so we remove the rest
parameter from the class’s constructor
and pass None
up to the parent constructor every time.
Since Null
objects don’t match anything,
Null._match
immediately returns whatever starting point it was given.
Every other matcher can now pass responsibility down the chain
without having to test whether it’s the last matcher in line or not.
With these changes in place, our literal matcher becomes:
class Lit(Match):
def __init__(self, chars, rest=None):
super().__init__(rest)
self.chars = chars
def _match(self, text, start):
end = start + len(self.chars)
if text[start:end] != self.chars:
return None
return self.rest._match(text, end)
Lit
’s constructor calls the constructor of its parent class
to initialize the things that all classes share,
then adds the data specific to this class.
It returns None
for “no match” or whatever self.rest
returns
If this object’s rest
is an instance of Null
,
this result will be the index after the overall match.
As before,
the matcher for *
checks what happens
if it matches an ever-larger part of the target string:
class Any(Match):
def __init__(self, rest=None):
super().__init__(rest)
def _match(self, text, start):
for i in range(start, len(text) + 1):
end = self.rest._match(text, i)
if end == len(text):
return end
return None
(The exercises will ask why loop has to run to len(text) + 1
.)
Finally,
the either/or matcher that prompted this refactoring becomes:
class Either(Match):
def __init__(self, left, right, rest=None):
super().__init__(rest)
self.left = left
self.right = right
def _match(self, text, start):
for pat in [self.left, self.right]:
end = pat._match(text, start)
if end is not None:
end = self.rest._match(text, end)
if end == len(text):
return end
return None
Looping over the left and right alternative saves us from repeating code or introducing a helper method. It also simplifies the handling of more than two options, which we explore in the exercises.
Crucially,
none of the existing tests change
because none of the matching classes’ constructors changed
and the signature of the match
method
(which they now inherit from the generic Match
class)
stayed the same as well.
We should add some tests for Null
,
but we have now met our original goal,
and as the exercises will show we can easily add matchers for other kinds of patterns.
Summary
Figure 4.5 summarizes the key ideas in this chapter; we will see the Null Object and Chain of Responsibility design patterns again.
Exercises
Looping
Rewrite the matchers so that a top-level object manages a list of matchers, none of which know about any of the others. Is this design simpler or more complicated than the Chain of Responsibility design?
Length Plus One
Why does the upper bound of the loop in the final version of Any
run to len(text) + 1
?
Find One or More
Extend the regular expression matcher to support +
,
meaning “match one or more characters”.
Match Sets of Characters
-
Add a new matching class that matches any character from a set, so that
Charset('aeiou')
matches any lower-case vowel. -
Create a matcher that matches a range of characters. For example,
Range("a", "z")
matches any single lower-case Latin alphabetic character. (This is just a convenience matcher: ranges can always be spelled out in full.) -
Write some tests for your matchers.
Exclusion
-
Create a matcher that doesn’t match a specified pattern. For example,
Not(Lit("abc"))
only succeeds if the text isn’t “abc”. -
Write some tests for it.
Make Repetition More Efficient
Rewrite Any
so that it does not repeatedly re-match text.
Multiple Alternatives
-
Modify
Either
so that it can match any number of sub-patterns, not just two. -
Write some tests for it.
-
What does your implementation do when no sub-patterns are specified?
Returning Matches
Modify the matcher so that it returns the substrings that matched
each part of the expression.
For example,
when *.txt
matches name.txt
,
the library should return some indication that *
matched the string "name"
.
Alternative Matching
The tool we have built implements lazy matching,
i.e.,
the *
character matches the shortest string it can
that results in the overall pattern matching.
Modify the code to do greedy matching instead,
and combine it with the solution to the previous exercise
for testing.