Bonus Material

Each chapter in this book is designed to be teachable in one classroom hour. This appendix presents material that extends core ideas but would break that attention budget.

Using Function Attributes

This material extends Chapter 6.

Since functions are objects, they can have attributes. The function dir (short for “directory”) returns a list of their names:

def example():
    "Docstring for example."
    print("in example")

print(dir(example))
['__annotations__', '__builtins__', '__call__', '__class__', \
'__closure__', '__code__', '__defaults__', '__delattr__', \
'__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', \
'__get__', '__getattribute__', '__getstate__', '__globals__', \
'__gt__', '__hash__', '__init__', '__init_subclass__', \
'__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', \
'__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', \
'__repr__', '__setattr__', '__sizeof__', '__str__', \
'__subclasshook__']

Most programmers never need to use most of these, but __name__ holds the function’s original name and __doc__ holds its docstring:

print("docstring:", example.__doc__)
print("name:", example.__name__)
docstring: Docstring for example.
name: example

We can modify the test runner of Chapter 6 to use the function’s __name__ attribute in reports instead of the key in the globals dictionary:

def run_tests(prefix):
    for (name, func) in globals().items():
        if name.startswith(prefix):
            try:
                func()
                print(func.__name__, "passed")
            except AssertionError:
                print(func.__name__, "failed")
            except Exception:
                print(func.__name__, "had error")

run_tests("test_")
test_sign_negative passed
test_sign_positive passed
test_sign_zero failed
test_sign_error had error

More usefully, we can say that if a test function’s docstring contains the string "test:skip" then we should skip the test, while "test:fail" means we expect this test to fail. Let’s rewrite our tests to show this off:

TEST_FAIL = "test:fail"
TEST_SKIP = "test:skip"

def test_sign_negative():
    "test:skip"
    assert sign(-3) == -1

def test_sign_positive():
    assert sign(19) == 1

def test_sign_zero():
    "test:fail"
    assert sign(0) == 0

def test_sign_error():
    """Expect an error."""
    assert sgn(1) == 1

and then modify run_tests to look for these strings and act accordingly:

def run_tests(prefix):
    all_names = [n for n in globals() if n.startswith(prefix)]
    for name in all_names:
        func = globals()[name]
        try:
            if func.__doc__ and TEST_SKIP in func.__doc__:
                print(f"skip: {name}")
            else:
                func()
                print(f"pass: {name}")
        except AssertionError as e:
            if TEST_FAIL in func.__doc__:
                print(f"pass (expected failure): {name}")
            else:
                print(f"fail: {name} {str(e)}")
        except Exception as e:
            doc = f"/{func.__doc__}" if func.__doc__ else ""
            print(f"error: {name}{doc} {str(e)}")

run_tests("test_")

The output is now:

skip: test_sign_negative
pass: test_sign_positive
pass (expected failure): test_sign_zero
error: test_sign_error/Expect an error. name 'sgn' is not defined

Instead of (ab)using docstrings like this, we can instead add our own attributes to functions. Let’s say that if a function has an attribute called skip with the value True then the function is to be skipped, while if it has an attribute called fail whose value is True then the test is expected to fail. Our tests become:

def test_sign_negative():
    assert sign(-3) == -1
test_sign_negative.skip = True

def test_sign_positive():
    assert sign(19) == 1

def test_sign_zero():
    assert sign(0) == 0
test_sign_zero.fail = True

def test_sign_error():
    assert sgn(1) == 1

We can write a helper function called classify to classify tests. Note that it uses hasattr to check if an attribute is present before trying to get its value:

def classify(func):
    if hasattr(func, "skip") and func.skip:
        return "skip"
    if hasattr(func, "fail") and func.fail:
        return "fail"
    return "run"

Finally, our test runner becomes:

def run_tests(prefix):
    all_names = [n for n in globals() if n.startswith(prefix)]
    for name in all_names:
        func = globals()[name]
        kind = classify(func)
        try:
            if kind == "skip":
                print(f"skip: {name}")
            else:
                func()
                print(f"pass: {name}")
        except AssertionError as e:
            if kind == "fail":
                print(f"pass (expected failure): {name}")
            else:
                print(f"fail: {name} {str(e)}")
        except Exception as e:
            print(f"error: {name} {str(e)}")

run_tests("test_")

Lazy Evaluation

This material extends Chapter 7.

One way to evaluate a design is to ask how extensible it is. The answer for our interpreter is now, “Pretty easily.” For example, we can add a comment “operation” that does nothing and returns None simply by writing do_comment function:

def do_comment(env, args):
    """Ignore instructions.
    ["comment" "text"] => None
    """
    return None

An if statement is a bit more complex. If its first argument is true, it evaluates and returns its second argument (the “if” branch). Otherwise, it evaluates and returns its second argument (the “else” branch):

def do_if(env, args):
    """Make a choice: only one sub-expression is evaluated.
    ["if" C A B] => A if C else B
    """
    assert len(args) == 3
    cond = do(env, args[0])
    choice = args[1] if cond else args[2]
    return do(env, choice)

As we said in Chapter 8, this is called lazy evaluation to distinguish it from the more usual eager evaluation that evaluates everything up front. do_if only evaluates what it absolutely needs to; most languages do this so that we can safely write things like:

if x != 0:
    return 1/x
else:
    return None

If the language always evaluated both branches, then the code shown above would fail whenever x was zero, even though it’s supposed to handle that case. In this case it might seem obvious what the language should do, but most languages use lazy evaluation for and and or as well so that expressions like:

thing and thing.part

will produce None if thing is None and reference.part if it isn’t.

Extension

This material extends Chapter 13.

It’s easy to check a single style rule by extending NodeVisitor, but what if we want to check dozens of rules? Traversing the AST dozens of times would be inefficient. And what if we want people to be able to add their own rules? Inheritance is the wrong tool for this: if several people each create their own NodeVisitor with a visit_Name method, we’d have to inherit from all those classes and then have the new class’s visit_Name call up to all of its parents’ equivalent methods.

One way around this is to inject methods into classes after they have been defined. The code fragment below creates a new class called BlankNodeVisitor that doesn’t add anything to NodeVisitor, then uses setattr to add a method to it after it has been defined (Figure B.1):

class BlankNodeVisitor(ast.NodeVisitor):
    pass

def print_name(self, node):
    print(node.id)

setattr(BlankNodeVisitor, "visit_Name", print_name)
Method injection
Figure B.1: Adding methods to classes after their definition.

This trick works because classes and objects are just specialized dictionaries (for some large value of “just”). If we create an object of BlankNodeVisitor and call its visit method:

with open(sys.argv[1], "r") as reader:
    source = reader.read()
tree = ast.parse(source)
finder = BlankNodeVisitor()
finder.visit(tree)

then the inherited generic_visit method does what it always does. When it encounters a Name node, it looks in the object for something called visit_Name. Since it doesn’t find anything, it looks in the object’s class for something with that name, finds our injected method, and calls it.

With a bit more work, we could have our injected method save and then call whatever visit_Name method was there when it was added to the class, but we would quickly run into a problem. As we’ve seen in earlier examples, the methods that handle nodes are responsible for deciding whether and when to recurse into those nodes’ children. If we pile method on top of one another, then either each one is going to trigger recursion (so we recurse many times) or there will have to be some way for each one to signal whether it did that so that other methods don’t.

To avoid this complication, most systems use a different approach. Consider this class:

Handler = namedtuple("Handler", ["func", "data"])

class RegisterNodeVisitor(ast.NodeVisitor):
    def __init__(self):
        super().__init__()
        self.handlers = {}

    def add_handler(self, nodeType, func, data=None):
        handler = Handler(func, data)
        if nodeType not in self.handlers:
            self.handlers[nodeType] = []
        self.handlers[nodeType].append(handler)

    def visit_Name(self, node):
        for handler in self.handlers.get(ast.Name, []):
            handler.func(node, handler.data)

The add_handler method takes three parameters: the type of node a callback function is meant to handle, the function itself, and an optional extra piece of data to pass to the function along with an AST node. It saves the handler function and the data in a lookup table indexed by the type of node the function is meant to handle. Each of the methods inherited from NodeVisitor then looks up handlers for its node type and runs them.

So what do handlers look like? Each one is a function that takes a node and some data as input and does whatever it’s supposed to do:

def count_names(node, counter):
    counter[node.id] += 1

Setting up the visitor is a bit more complicated, since we have to create and register the handler:

with open(sys.argv[1], "r") as reader:
    source = reader.read()
tree = ast.parse(source)

finder = RegisterNodeVisitor()
counter = Counter()
finder.add_handler(ast.Name, count_names, counter)

finder.visit(tree)
print(counter)

However, we can now register as many handlers as we want for each kind of node.

Tracing Inheritance

This material extends Chapter 13.

In order to keep track of the code we wrote for this book, we built a tool that reports which methods are defined or redefined in which classes. To show how it works, this file that defines four classes, each of which defines or redefines some methods:

class Parent:
    def red(self):
        pass

    def green(self):
        pass


class LeftChild(Parent):
    def green(self):
        pass

    def blue(self):
        pass


class RightChild(Parent):
    def red(self):
        pass

    def blue(self):
        pass


class GrandChild(LeftChild):
    def red(self):
        pass

    def blue(self):
        pass

    def orange(self):
        pass

As in Chapter 13, our class’s constructor creates a stack to keep track of where we are. It also creates a couple of dictionaries to keep track of how classes inherit from each other and the methods each class defines:

class FindClassesAndMethods(ast.NodeVisitor):
    def __init__(self):
        super().__init__()
        self.stack = []
        self.parents = {}
        self.methods = {}

When we encounter a new class definition, we push its name on the stack, record its parents, and create an empty set to hold its methods:

    def visit_ClassDef(self, node):
        class_name = node.name
        assert class_name not in self.methods
        self.stack.append(class_name)
        self.methods[class_name] = set()
        self.parents[class_name] = {p.id for p in node.bases}
        self.generic_visit(node)
        self.stack.pop()

When we encounter a function definition, the first thing we do is check the stack. If it’s empty, we’re looking at a top-level function rather than a method, so there’s nothing for us to do. (We actually should recurse through the function’s children, since it’s possible to define classes inside functions, but we’ll leave as an exercise.) If this function definition is inside a class, on the other hand, we add its name to our records:

    def visit_FunctionDef(self, node):
        if not self.stack:
            return
        class_name = self.stack[-1]
        assert class_name in self.methods
        method_name = node.name
        assert method_name not in self.methods[class_name]
        self.methods[class_name].add(method_name)

Once we’re done searching the AST, we print out a table of the classes and methods we’ve seen (Table B.1). We could make this display easier to read—for example, we could sort the classes from parent to child and display methods in the order they were first defined—but none of that requires us to inspect the AST.

Table B.1: Inheritance and methods.
GrandChild LeftChild Parent RightChild
blue X X X
green X X
orange X
red X X X

Inspecting Functions

This material extends Chapter 15.

The implementation of dataframe filtering in Chapter 15 was somewhat brittle. A better implementation of filtering would make use of the fact that Python’s inspect module lets us examine objects in memory. In particular, inspect.signature can tell us what parameters a function takes:

import inspect

def example(first, second):
    pass

sig = inspect.signature(example)
print("signature:", sig)
print("type:", type(sig))
print("names:", sig.parameters)
print("parameters:", list(sig.parameters.keys()))
signature: (first, second)
type: <class 'inspect.Signature'>
names: OrderedDict([('first', <Parameter "first">), ('second', \
<Parameter "second">)])
parameters: ['first', 'second']

If, for example, the user wants to compare the red and blue columns of a dataframe, they can give us a function that has two parameters called red and blue. We can then use those parameter names to figure out which columns we need from the dataframe.

User-Defined Classes

This material extends Chapter 16.

The persistence framework of Chapter 16 only handles built-in data types, but can easily be extended to handle user-defined classes as well. To start, we refactor the code so that the save method doesn’t get any larger:

class SaveExtend(SaveAlias):
    def __init__(self, writer):
        super().__init__(writer)

    def save(self, thing):
        if self._aliased(thing):
            return
        if self._builtin(thing):
            return

        assert False, f"Don't know how to handle {thing}"

The method to handle built-in types is:

    def _builtin(self, thing):
        typename = type(thing).__name__
        method = f"_{typename}"
        if not hasattr(self, method):
            return False
        self.seen.add(id(thing))
        getattr(self, method)(thing)
        return True

and the one that handles aliases is:

    def _aliased(self, thing):
        thing_id = id(thing)
        if thing_id not in self.seen:
            return False
        self._write("alias", thing_id, "")
        return True

None of this code is new: we’ve just moved things into methods to make each piece easier to understand.

So how does a class indicate that it can be saved and loaded by our framework? Our options are:

  1. Require it to inherit from a base class that we provide so that we can use isinstance to check if an object is persistable. This approach is used in strictly-typed languages like Java, but method #2 below is considered more Pythonic.

  2. Require it to implement a method with a specific name and signature without deriving from a particular base class. This approach is called duck typing: if it walks like a duck and quacks like a duck, it’s a duck. Since option #1 would require users to write this method anyway, it’s the one we’ll choose.

  3. Require users to register a helper class that knows how to save and load objects of the class we’re interested in. This approach is also commonly used in strictly-typed languages as a way of adding persistence after the fact without disrupting the class hierarchy.

To implement option #2, we specify that if a class has a method called to_dict, we will call that to get the object’s contents as a dictionary and then persist that dictionary. Before doing that, though, we will save a line indicating that this dictionary should be used to reconstruct an object of a particular class:

    def _extension(self, thing):
        if not hasattr(thing, "to_dict"):
            return False
        self._write("@extension", id(thing), thing.__class__.__name__)
        self.save(thing.to_dict())
        return True

Loading user-defined classes requires more work because we have to map class names back to actual classes. (We could also use introspection to find all the classes in the program and build a lookup table of the ones with the right method.) We start by modifying the loader’s constructor to take zero or more extension classes as arguments and then build a name-to-class lookup table from them:

class LoadExtend(LoadAlias):
    def __init__(self, reader, *extensions):
        super().__init__(reader)
        self.seen = {}
        self.extensions = {e.__name__: e for e in extensions}

The load method then looks for aliases, built-in types, and extensions in that order. Instead of using a chain of if statements we loop over the methods that handle these cases. If a method decides that it can handle the incoming data it returns a result; if it can’t, it raises a KeyError exception, and if none of the methods handle a case we fail:

    def load(self):
        key, ident, value = self._next()
        for method in (self._aliased, self._builtin, self._extension):
            try:
                return method(key, ident, value)
            except KeyError:
                pass
        assert False, f"Don't know how to handle {key} {ident} {value}"

The code to handle built-ins and aliases is copied from our previous work and modified to raise KeyError:

    def _aliased(self, key, ident, value):
        if key != "alias":
            raise KeyError()
        assert ident in self.seen
        return self.seen[ident]

    def _builtin(self, key, ident, value):
        method = f"_{key}"
        if not hasattr(self, method):
            raise KeyError()
        return getattr(self, method)(ident, value)

The method that handles extensions checks that the value on the line just read indicates an extension, then reads the dictionary containing the object’s contents from the input stream and uses it to build an instance of the right class:

    def _extension(self, key, ident, value):
        if (key != "@extension") or (value not in self.extensions):
            raise KeyError()
        cls = self.extensions[value]
        contents = self.load()
        return cls(**contents)

Here’s a class that defines the required method:

class Parent:
    def __init__(self, name):
        self.name = name

    def to_dict(self):
        return {"name": self.name}

and here’s a test to make sure everything works:

def test_extend_extension_class():
    fixture = Parent("subject")
    writer = StringIO()
    Save(writer).save(fixture)
    reader = StringIO(writer.getvalue())
    result = Load(reader, Parent).load()
    assert isinstance(result, Parent)
    assert result.name == fixture.name

What’s in a Name?

The first version of these classes used the word "extension" rather than "@extension". That led to the most confusing bug in this whole chapter. When load reads a line, it runs self._builtin before running self._extension. If the first word on the line is "extension" (without the @) then self._builtin constructs the method name _extension, finds that method, and calls it as if we were loading an object of a built-in type: which we’re not. Using @extension as the leading indicator leads to self._builtin checking for "_@extension" in the loader’s attributes, which doesn’t exist, so everything goes as it should.

Floating Point Numbers

This material extends Chapter 17.

The rules for storing floating point numbers make those for Unicode look simple. The root of the problem is that we cannot represent an infinite number of real values with a finite set of bit patterns. And no matter what values we represent, there will be an infinite number of values between each of them that we can’t. The explanation that follows is simplified to keep it manageable; please read [Goldberg1991] for more detail.

Floating point numbers are represented by a sign, a mantissa, and an exponent. In a 32-bit word the IEEE 754 standard calls for 1 bit of sign, 23 bits for the mantissa, and 8 bits for the exponent. We will illustrate how it works using a much smaller representation: no sign, 3 bits for the mantissa, and 2 for the exponent. Figure B.2 shows the values this scheme can represent.

Representing floating point numbers
Figure B.2: Representing floating point numbers.

The IEEE standard avoids the redundancy in this representation by shifting things around. Even with that, though, formats like this can’t represent a lot of values: for example, ours can store 8 and 10 but not 9. This is exactly like the problem hand calculators have with fractions like 1/3: in decimal, we have to round that to 0.3333 or 0.3334.

But if this scheme has no representation for 9 then \( 8+1 \) must be stored as either 8 or 10. What should \( 8+1+1 \) be? If we add from the left, \( (8+1)+1 \) is \( 8+1 \) is 8, but if we add from the right, \( 8+(1+1) \) is \( 8+2 \) is 10. Changing the order of operations makes the difference between right and wrong.

The authors of numerical libraries spend a lot of time worrying about things like this. In this case sorting the values and adding them from smallest to largest gives the best chance of getting the best possible answer. In other situations, like inverting a matrix, the rules are much more complicated.

Another observation about our number line is that while the values are unevenly spaced, the relative spacing between each set of values stays the same: the first group is separated by 1, then the separation becomes 2, then 4, and so on. This observation leads to a couple of useful definitions:

  • The absolute error in an approximation is the absolute value of the difference between the approximation and the actual value.

  • The relative error is the ratio of the absolute error to the absolute value we’re approximating.

For example, being off by 1 in approximating 8+1 and 56+1 is the same absolute error, but the relative error is larger in the first case than in the second. Relative error is almost always more useful than absolute: it makes little sense to say that we’re off by a hundredth when the value in question is a billionth.

One implication of this is that we should never compare floating point numbers with == or != because two numbers calculated in different ways will probably not have exactly the same bits. It’s safe to use <, >=, and other orderings, though, since they don’t depend on being the same down to the last bit.

If we do want to compare floating point numbers we can use something like the approx class from pytest which checks whether two numbers are within some tolerance of each other. A completely different approach is to use something like the fractions module, which (as its name suggests) uses numerators and denominators to avoid some precision issues. This post describes one clever use of the module.

Big and Little Endian

This material extends Chapter 17.

Suppose we want to store a 32-bit integer in memory. As Figure B.3 shows, we can order its four bytes in two different ways. Little-endian order stores the least significant bits of integer at the first (lowest) address in memory, while big-endian order stores the most significant bits first.

Endian order
Figure B.3: Big-endian and little-endian byte order.

Modern Intel processors use little-endian order, but as this article explains, some other processors (and most network protocols) use big-endian order. There are pros and cons to both, which we won’t go into here. What you do need to know is that if you move data from one architecture to another, it’s your responsibility to flip the bytes around, because the machine doesn’t know what the bytes mean. This is such a pain that the struct module and other libraries like it will do things for you if you ask it to. If you’re using struct, the first character of a format string optionally indicates the byte order (Table B.2).

Table B.2: struct package endian indicators.
Character Byte order Size Alignment
@ native native native
= native standard none
< little endian standard none
> big endian standard none
! network standard none

Generating Test Cases

This material extends Chapter 20.

Theorem provers like Z3 and PicoSAT are far more powerful than most programmers realize. Borrowing an example from Andreas Zeller, we can use theorem provers to generate test cases. Suppose we have a function that classifies triangles as equilateral, scalene, or isosceles. We can set up some integer variables:

A = Int("A")
B = Int("B")
C = Int("C")
lengths = (A > 0, B > 0, C > 0)

and then ask it to create an equilateral triangle based solely on the definition:

equilateral = And(A == B, B == C, C == A)
solver = Solver()
solver.add(lengths)
solver.add(equilateral)
print("equilateral", solver.check(), solver.model())
equilateral sat [C = 1, B = 1, A = 1]

The same technique can generate a test case for scalene triangles:

scalene = And(A != B, B != C, C != A)
solver = Solver()
solver.add(lengths)
solver.add(scalene)
print("scalene", solver.check(), solver.model())
scalene sat [C = 3, A = 1, B = 2]

and isosceles triangles:

isosceles = Or(
    And(A == B, C != A),
    And(A == B, B != C),
    And(A != B, B == C)
)
solver = Solver()
solver.add(lengths)
solver.add(isosceles)
print("isosceles", solver.check(), solver.model())
isosceles sat [C = 2, A = 1, B = 2]