Object Persistence

A persistence framework saves and restores objects.
Persistence must handle aliasing and circularity.
Users should be able to extend persistence to handle objects of their own types.
Software designs should be open for extension but closed for modification.

Terms defined: atomic value, list comprehension, Open-Closed Principle, persistence

Version control can keep track of our files, but what should we put in them? Plain text works well for things like this chapter, but the data structures used to represent HTML (Chapter 11) or the state of a game aren’t easy to represent in prose.

Another option is to store objects, i.e., to save a list of dictionaries as-is rather than flattering it into rows and columns. Python’s pickle module does this in a Python-specific way, while the json module saves some kinds of data as text formatted like JavaScript objects. As odd as it may seem, this has become a cross-language standard.

The phrase “some kinds of data” is the most important part of the preceding paragraph. Since programs can define new classes, a persistence framework has to choose one of the following:

Only handle built-in types, or even more strictly, only handle types that are common across many languages, so that data saved by Python can be read by JavaScript and vice versa.
Provide a way for programs to convert from user-defined types to built-in types and then save those. This option is less restrictive than the first but can lead to some information being lost. For example, if instances of a program’s User class are saved as dictionaries, the program that reads data may wind up with dictionaries instead of users.
Save class definitions as well as objects’ values so that when a program reads saved data it can reconstruct the classes and then create fully functional instances of them. This choice is the most powerful, but it is also the hardest to implement, particularly across languages. It is also the riskiest: if a program is running third-party code in order to restore objects, it has to trust that code not to do anything malicious.

This chapter starts by implementing the first option (built-in types only), then extends it to handle objects that the data structure refers to in several places (which JSON does not). To keep parsing and testing simple, our framework will store everything as text with one value per line; we will look at non-text options in Chapter 17, and at how to handle user-defined types in Appendix B.

Built-in Types

The first thing we need to do is specify our data format. We will store each atomic value on a line of its own with a type name and a value separated by a colon:

bool:True
int:123

Since we are storing things as text, we have to handle strings carefully: for example, we might need to save the string "str:something" and later be able to tell that it isn’t the string "something". We do this by splitting strings on newline characters and saving the number of lines, followed by the actual data:

# input
this is
two lines

# output
str:2
this is
two lines

The function save handles three of Python’s built-in types to start with:

def save(writer, thing):
    if isinstance(thing, bool):
        print(f"bool:{thing}", file=writer)

    elif isinstance(thing, float):
        print(f"float:{thing}", file=writer)

    elif isinstance(thing, int):
        print(f"int:{thing}", file=writer)

    else:
        raise ValueError(f"unknown type of thing {type(thing)}")

The function that loads data starts by reading a single line, stripping off the newline at the end (which is added automatically by the print statement in save), and then splitting the line on colons. After checking that there are two fields, it uses the type name in the first field to decide how to handle the second:

def load(reader):
    line = reader.readline()[:-1]
    assert line, "Nothing to read"
    fields = line.split(":", maxsplit=1)
    assert len(fields) == 2, f"Badly-formed line {line}"
    key, value = fields

    if key == "bool":
        names = {"True": True, "False": False}
        assert value in names, f"Unknown Boolean {value}"
        return names[value]

    elif key == "float":
        return float(value)

    elif key == "int":
        return int(value)

    else:
        raise ValueError(f"unknown type of thing {line}")

Saving a list is almost as easy: we save the number of items in the list, and then save each item with a recursive call to save. For example, the list [55, True, 2.71] is saved as shown in Figure 16.1. The code to do this is:

    elif isinstance(thing, list):
        print(f"list:{len(thing)}", file=writer)
        for item in thing:
            save(writer, item)

while to load a list, we just read the specified number of items:

    elif key == "list":
        return [load(reader) for _ in range(int(value))]

Saving lists — Figure 16.1: Saving nested data structures.

Notice that save and load don’t need to know what kinds of values are in the list. Each recursive call advances the input or output stream by precisely as many lines as it needs to. As a result, this approach should handle empty lists and nested lists without any extra work.

Our functions handle sets in exactly the same way as lists; the only difference is using the keyword set instead of the keyword list in the opening line. To save a dictionary, we save the number of entries and then save each key and value in turn:

    elif isinstance(thing, dict):
        print(f"dict:{len(thing)}", file=writer)
        for (key, value) in thing.items():
            save(writer, key)
            save(writer, value)

The code to load a dictionary is analogous. With this machinery in place, we can save our first data structure:

save(sys.stdout, [False, 3.14, "hello", {"left": 1, "right": [2, 3]}])

list:4
bool:False
float:3.14
str:1
hello
dict:2
str:1
left
int:1
str:1
right
list:2
int:2
int:3

We now need to write some unit tests. We will use two tricks when doing this:

The StringIO class from Python’s io module allows us to read from strings and write to them using the functions we normally use to read and write files. Using this lets us run our tests without creating lots of little files as a side effect.
The dedent function from Python’s textwrap module removes leading indentation from the body of a string. As the example below shows, dedent allows us to indent a fixture the same way we indent our Python code, which makes the test easier to read.

def test_save_list_flat():
    fixture = [0, False]
    expected = dedent("""\
    list:2
    int:0
    bool:False
    """)
    output = StringIO()
    save(output, fixture)
    assert output.getvalue() == expected

Converting to Classes

The save and load functions we built in the previous section work, but as we were extending them we had to modify their internals every time we wanted to do something new.

The Open-Closed Principle states that software should be open for extension but closed for modification, i.e., that it should be possible to extend functionality without having to rewrite existing code. This allows old code to use new code, but only if our design permits the kinds of extensions people are going to want to make. (Even then, it often leads to deep class hierarchies that can be hard for the next programmer to understand.) Since we can’t anticipate everything, it is normal to have to revise a design the first two or three times we try to extend it. As [Brand1995] said of buildings, the things we make learn how to do things better as we use them.

In this case, we can follow the Open-Closed Principle by rewriting our functions as classes and by using yet another form of dynamic dispatch to handle each item so that we don’t have to modify a multi-way if statement each time we add a new capability. If we have an object obj, then hasattr(obj, "name") tells us whether that object has an attribute called "name". If it does, getattr(obj, "name") returns that attribute’s value; if that attribute happens to be a method, we can then call it like a function:

class Example:
    def __init__(self, label):
        self.label = label

    def get_size(self):
        return len(self.label)

ex = Example("thing")
print("ex has missing", hasattr(ex, "missing"))
print("ex has label", hasattr(ex, "label"), "with value", getattr(ex, "label"))
print("ex has get_size", hasattr(ex, "get_size"))
method = getattr(ex, "get_size")
print("result of calling method", method())

ex has missing False
ex has label True with value thing
ex has get_size True
result of calling method 5

Using this, the core of our saving class is:

class SaveObjects:
    def __init__(self, writer):
        self.writer = writer

    def save(self, thing):
        typename = type(thing).__name__
        method = f"save_{typename}"
        assert hasattr(self, method), \
            f"Unknown object type {typename}"
        getattr(self, method)(thing)

We have called this class SaveObjects instead of just Save because we are going to create other variations on it. SaveObjects.save figures out which method to call to save a particular thing by constructing a name based on the thing’s type, checking whether that method exists, and then calling it. As in the previous example of dynamic dispatch, the methods that handle specific items must all have the same signature so that they can be called interchangeably. For example, the methods that write integers and strings are:

    def save_int(self, thing):
        self._write("int", thing)

    def save_str(self, thing):
        lines = thing.split("\n")
        self._write("str", len(lines))
        for line in lines:
            print(line, file=self.writer)

LoadObjects.load combines dynamic dispatch with the string handling of our original load function:

class LoadObjects:
    def __init__(self, reader):
        self.reader = reader

    def load(self):
        line = self.reader.readline()[:-1]
        assert line, "Nothing to read"
        fields = line.split(":", maxsplit=1)
        assert len(fields) == 2, f"Badly-formed line {line}"
        key, value = fields
        method = f"load_{key}"
        assert hasattr(self, method), f"Unknown object type {key}"
        return getattr(self, method)(value)

The methods that load individual items are even simpler. For example, we load a floating-point number like this:

    def load_float(self, value):
        return float(value)

Aliasing

Consider the two lines of code below, which created the data structure shown in Figure 16.2. If we save this structure and then reload it using what we have built so far, we will wind up with two copies of the list containing the string "content" instead of one. This won’t be a problem if we only ever read the reloaded data, but if we modify the new copy of fixture[0], we won’t see that change reflected in fixture[1], where we would have seen the change in the original data structure:

shared = ["content"]
fixture = [shared, shared]

Saving aliased data incorrectly — Figure 16.2: Saving aliased data without respecting aliases.

The problem is that the list shared is aliased, i.e., there are two or more references to it. To reconstruct the original data correctly, we need to:

keep track of everything we have saved;
save a marker instead of the object itself when we try to save it a second time; and
reverse this process when loading data.

We can keep track of the things we have saved using Python’s built-in id function, which returns a unique ID for every object in the program. For example, even if two lists contain exactly the same values, id will report different IDs for those lists because they’re stored in different locations in memory. We can use this to:

store the IDs of all the objects we’ve already saved in a set, and then
write a special entry with the keyword alias and its unique ID when we see an object for the second time.

Here’s the start of SaveAlias:

class SaveAlias(SaveObjects):
    def __init__(self, writer):
        super().__init__(writer)
        self.seen = set()

    def save(self, thing):
        thing_id = id(thing)
        if thing_id in self.seen:
            self._write("alias", thing_id, "")
            return

        self.seen.add(id(thing))
        typename = type(thing).__name__
        method = f"save_{typename}"
        assert hasattr(self, method), f"Unknown object type {typename}"
        getattr(self, method)(thing)

Its constructor creates an empty set of IDs seen so far. If SaveAlias.save notices that the object it’s about to save has been saved before, it writes a line like this:

alias:12345678:

where 12345678 is the object’s ID. (The exercises will ask why the trailing colon needs to be there.) If the object hasn’t been seen before, SaveAlias saves the object’s type, its ID, and either its value or its length:

    def save_list(self, thing):
        self._write("list", id(thing), len(thing))
        for item in thing:
            self.save(item)

SaveAlias._list is a little different from SaveObjects._list because it has to save each object’s identifier along with its type and its value or length. Our LoadAlias class needs a similar change compared to LoadObjects. The first version is shown below; as we will see, it contains a subtle bug:

class LoadAlias(LoadObjects):
    def __init__(self, reader):
        super().__init__(reader)
        self.seen = {}

    def load(self):
        line = self.reader.readline()[:-1]
        assert line, "Nothing to read"
        fields = line.split(":", maxsplit=2)
        assert len(fields) == 3, f"Badly-formed line {line}"
        key, ident, value = fields

        # the lines below contain a bug
        if key == "alias":
            assert ident in self.seen
            return self.seen[ident]

        method = f"load_{key}"
        assert hasattr(self, method), f"Unknown object type {key}"
        result = getattr(self, method)(value)
        self.seen[ident] = result
        return result

The first test of our new code is:

def test_aliasing_no_aliasing():
    fixture = ["a", {"b": True, 7: {"c": "d"}}]
    assert roundtrip(fixture) == fixture

which uses this helper function:

def roundtrip(fixture):
    writer = StringIO()
    SaveAlias(writer).save(fixture)
    reader = StringIO(writer.getvalue())
    return LoadAlias(reader).load()

There isn’t any aliasing in the test case, but that’s deliberate: we want to make sure we haven’t broken code that was working before we move on. Here’s a test that actually includes some aliasing:

def test_aliasing_shared_child():
    shared = ["content"]
    fixture = [shared, shared]
    result = roundtrip(fixture)
    assert result == fixture
    assert id(result[0]) == id(result[1])
    result[0][0] = "changed"
    assert result[1][0] == "changed"

It checks that the aliased sub-list is actually aliased after the data is restored, then checks that changes to the sub-list through one alias show up through the other. The second check ought to be redundant, but it’s still comforting.

There’s one more case to check, and unfortunately it reveals a bug. The two lines:

fixture = []
fixture.append(fixture)

create the data structure shown in Figure 16.3, in which an object contains a reference to itself. Our code ought to handle this case but doesn’t: when we try to read in the saved data, LoadAlias.load sees the alias line but then says it can’t find the object being referred to.

A circular data structure — Figure 16.3: A data structure that contains a reference to itself.

The problem is these lines in LoadAlias.load marked as containing a bug, in combination with these lines inherited from LoadObjects:

    def load_list(self, value):
        return [self.load() for _ in range(int(value))]

Let’s trace execution for the saved data:

list:4484025600:1
alias:4484025600:

The first line tells us that there’s a list whose ID is 4484025600 so we LoadObjects._list to load a list of one element.
LoadObjects._list called LoadAlias.load recursively to load that one element.
LoadAlias.load reads the second line of saved data, which tells it to re-use the data whose ID is 4484025600. But LoadObjects._list hasn’t created that list yet—it is still reading the elements—so LoadAlias.load hasn’t added the list to seen.

The solution is to reorder the operations, which unfortunately means writing new versions of all the methods defined in LoadObjects. The new implementation of _list is:

    def load_list(self, ident, length):
        result = []
        self.seen[ident] = result
        for _ in range(int(length)):
            result.append(self.load())
        return result

This method creates the list it’s going to return, adds that list to the seen dictionary immediately, and then loads list items recursively. We have to pass it the ID of the list to use as the key in seen, and we have to use a loop rather than a list comprehension, but the changes to save_set and save_dict follow exactly the same pattern.

word = "word"
child = [word, word]
parent = []
parent.append(parent)
parent.append(child)

saver = SaveAlias(sys.stdout)
saver.save(parent)

list:4539747200:2
alias:4539747200:
list:4539552960:2
str:4539552048:1
word
alias:4539552048:

Summary

Figure 16.4 summarizes the ideas introduced in this chapter, while Appendix B shows how to extend our framework to handle user-defined classes.

Concept map for persistence — Figure 16.4: Concepts for persistence.

Exercises

Dangling Colon

Why is there a colon at the end of the line alias:12345678: when we create an alias marker?

Versioning

We now have several versions of our data storage format. Early versions of our code can’t read the archives created by later ones, and later ones can’t read the archives created early on (which used two fields per line rather than three). This problem comes up all the time in long-lived libraries and applications, and the usual solution is to include some sort of version marker at the start of each archive to indicate what version of the software created it (and therefore how it should be read). Modify the code we have written so far to do this.

Strings

Modify the framework so that strings are stored using escape characters like \n instead of being split across several lines.

Who Calculates?

Why doesn’t LoadAlias.load calculate object IDs? Why does it use the IDs saved in the archive instead?

Using Globals

The lesson on unit testing introduced the function globals, which can be used to look up everything defined at the top level of a program.

Modify the persistence framework so that it looks for save_ and load_ functions using globals.
Why is this a bad idea?