Object Persistence
- A persistence framework saves and restores objects.
- Persistence must handle aliasing and circularity.
- Users should be able to extend persistence to handle objects of their own types.
- Software designs should be open for extension but closed for modification.
Terms defined: atomic value, list comprehension, Open-Closed Principle, persistence
Version control can keep track of our files, but what should we put in them? Plain text works well for things like this chapter, but the data structures used to represent HTML (Chapter 11) or the state of a game aren’t easy to represent in prose.
Another option is to store objects,
i.e.,
to save a list of dictionaries as-is
rather than flattering it into rows and columns.
Python’s pickle
module does this in a Python-specific way,
while the json
module saves some kinds of data as text formatted like JavaScript objects.
As odd as it may seem,
this has become a cross-language standard.
The phrase “some kinds of data” is the most important part of the preceding paragraph. Since programs can define new classes, a persistence framework has to choose one of the following:
-
Only handle built-in types, or even more strictly, only handle types that are common across many languages, so that data saved by Python can be read by JavaScript and vice versa.
-
Provide a way for programs to convert from user-defined types to built-in types and then save those. This option is less restrictive than the first but can lead to some information being lost. For example, if instances of a program’s
User
class are saved as dictionaries, the program that reads data may wind up with dictionaries instead of users. -
Save class definitions as well as objects’ values so that when a program reads saved data it can reconstruct the classes and then create fully functional instances of them. This choice is the most powerful, but it is also the hardest to implement, particularly across languages. It is also the riskiest: if a program is running third-party code in order to restore objects, it has to trust that code not to do anything malicious.
This chapter starts by implementing the first option (built-in types only), then extends it to handle objects that the data structure refers to in several places (which JSON does not). To keep parsing and testing simple, our framework will store everything as text with one value per line; we will look at non-text options in Chapter 17, and at how to handle user-defined types in Appendix B.
Built-in Types
The first thing we need to do is specify our data format. We will store each atomic value on a line of its own with a type name and a value separated by a colon:
bool:True
int:123
Since we are storing things as text,
we have to handle strings carefully:
for example,
we might need to save the string "str:something"
and later be able to tell that it isn’t the string "something"
.
We do this by splitting strings on newline characters
and saving the number of lines,
followed by the actual data:
# input
this is
two lines
# output
str:2
this is
two lines
The function save
handles three of Python’s built-in types to start with:
def save(writer, thing):
if isinstance(thing, bool):
print(f"bool:{thing}", file=writer)
elif isinstance(thing, float):
print(f"float:{thing}", file=writer)
elif isinstance(thing, int):
print(f"int:{thing}", file=writer)
else:
raise ValueError(f"unknown type of thing {type(thing)}")
The function that loads data starts by reading a single line,
stripping off the newline at the end
(which is added automatically by the print
statement in save
),
and then splitting the line on colons.
After checking that there are two fields,
it uses the type name in the first field
to decide how to handle the second:
def load(reader):
line = reader.readline()[:-1]
assert line, "Nothing to read"
fields = line.split(":", maxsplit=1)
assert len(fields) == 2, f"Badly-formed line {line}"
key, value = fields
if key == "bool":
names = {"True": True, "False": False}
assert value in names, f"Unknown Boolean {value}"
return names[value]
elif key == "float":
return float(value)
elif key == "int":
return int(value)
else:
raise ValueError(f"unknown type of thing {line}")
Saving a list is almost as easy:
we save the number of items in the list,
and then save each item
with a recursive call to save
.
For example,
the list [55, True, 2.71]
is saved as shown in Figure 16.1.
The code to do this is:
elif isinstance(thing, list):
print(f"list:{len(thing)}", file=writer)
for item in thing:
save(writer, item)
while to load a list, we just read the specified number of items:
elif key == "list":
return [load(reader) for _ in range(int(value))]
Notice that save
and load
don’t need to know
what kinds of values are in the list.
Each recursive call advances the input or output stream
by precisely as many lines as it needs to.
As a result,
this approach should handle empty lists and nested lists without any extra work.
Our functions handle sets in exactly the same way as lists;
the only difference is using the keyword set
instead of the keyword list
in the opening line.
To save a dictionary,
we save the number of entries
and then save each key and value in turn:
elif isinstance(thing, dict):
print(f"dict:{len(thing)}", file=writer)
for (key, value) in thing.items():
save(writer, key)
save(writer, value)
The code to load a dictionary is analogous. With this machinery in place, we can save our first data structure:
save(sys.stdout, [False, 3.14, "hello", {"left": 1, "right": [2, 3]}])
list:4
bool:False
float:3.14
str:1
hello
dict:2
str:1
left
int:1
str:1
right
list:2
int:2
int:3
We now need to write some unit tests. We will use two tricks when doing this:
-
The
StringIO
class from Python’sio
module allows us to read from strings and write to them using the functions we normally use to read and write files. Using this lets us run our tests without creating lots of little files as a side effect. -
The
dedent
function from Python’stextwrap
module removes leading indentation from the body of a string. As the example below shows,dedent
allows us to indent a fixture the same way we indent our Python code, which makes the test easier to read.
def test_save_list_flat():
fixture = [0, False]
expected = dedent("""\
list:2
int:0
bool:False
""")
output = StringIO()
save(output, fixture)
assert output.getvalue() == expected
Converting to Classes
The save
and load
functions we built in the previous section work,
but as we were extending them
we had to modify their internals
every time we wanted to do something new.
The Open-Closed Principle states that software should be open for extension but closed for modification, i.e., that it should be possible to extend functionality without having to rewrite existing code. This allows old code to use new code, but only if our design permits the kinds of extensions people are going to want to make. (Even then, it often leads to deep class hierarchies that can be hard for the next programmer to understand.) Since we can’t anticipate everything, it is normal to have to revise a design the first two or three times we try to extend it. As [Brand1995] said of buildings, the things we make learn how to do things better as we use them.
In this case,
we can follow the Open-Closed Principle by rewriting our functions as classes
and by using yet another form of dynamic dispatch to handle each item
so that we don’t have to modify a multi-way if
statement
each time we add a new capability.
If we have an object obj
,
then hasattr(obj, "name")
tells us whether that object has an attribute called "name"
.
If it does,
getattr(obj, "name")
returns that attribute’s value;
if that attribute happens to be a method,
we can then call it like a function:
class Example:
def __init__(self, label):
self.label = label
def get_size(self):
return len(self.label)
ex = Example("thing")
print("ex has missing", hasattr(ex, "missing"))
print("ex has label", hasattr(ex, "label"), "with value", getattr(ex, "label"))
print("ex has get_size", hasattr(ex, "get_size"))
method = getattr(ex, "get_size")
print("result of calling method", method())
ex has missing False
ex has label True with value thing
ex has get_size True
result of calling method 5
Using this, the core of our saving class is:
class SaveObjects:
def __init__(self, writer):
self.writer = writer
def save(self, thing):
typename = type(thing).__name__
method = f"save_{typename}"
assert hasattr(self, method), \
f"Unknown object type {typename}"
getattr(self, method)(thing)
We have called this class SaveObjects
instead of just Save
because we are going to create other variations on it.
SaveObjects.save
figures out which method to call
to save a particular thing
by constructing a name based on the thing’s type,
checking whether that method exists,
and then calling it.
As in the previous example of dynamic dispatch,
the methods that handle specific items
must all have the same signature
so that they can be called interchangeably.
For example,
the methods that write integers and strings are:
def save_int(self, thing):
self._write("int", thing)
def save_str(self, thing):
lines = thing.split("\n")
self._write("str", len(lines))
for line in lines:
print(line, file=self.writer)
LoadObjects.load
combines dynamic dispatch with
the string handling of our original load
function:
class LoadObjects:
def __init__(self, reader):
self.reader = reader
def load(self):
line = self.reader.readline()[:-1]
assert line, "Nothing to read"
fields = line.split(":", maxsplit=1)
assert len(fields) == 2, f"Badly-formed line {line}"
key, value = fields
method = f"load_{key}"
assert hasattr(self, method), f"Unknown object type {key}"
return getattr(self, method)(value)
The methods that load individual items are even simpler. For example, we load a floating-point number like this:
def load_float(self, value):
return float(value)
Aliasing
Consider the two lines of code below,
which created the data structure shown in Figure 16.2.
If we save this structure and then reload it
using what we have built so far,
we will wind up with two copies of the list containing the string "content"
instead of one.
This won’t be a problem if we only ever read the reloaded data,
but if we modify the new copy of fixture[0]
,
we won’t see that change reflected in fixture[1]
,
where we would have seen the change in the original data structure:
shared = ["content"]
fixture = [shared, shared]
The problem is that the list shared
is aliased,
i.e.,
there are two or more references to it.
To reconstruct the original data correctly,
we need to:
-
keep track of everything we have saved;
-
save a marker instead of the object itself when we try to save it a second time; and
-
reverse this process when loading data.
We can keep track of the things we have saved
using Python’s built-in id
function,
which returns a unique ID for every object in the program.
For example,
even if two lists contain exactly the same values,
id
will report different IDs for those lists
because they’re stored in different locations in memory.
We can use this to:
-
store the IDs of all the objects we’ve already saved in a set, and then
-
write a special entry with the keyword
alias
and its unique ID when we see an object for the second time.
Here’s the start of SaveAlias
:
class SaveAlias(SaveObjects):
def __init__(self, writer):
super().__init__(writer)
self.seen = set()
def save(self, thing):
thing_id = id(thing)
if thing_id in self.seen:
self._write("alias", thing_id, "")
return
self.seen.add(id(thing))
typename = type(thing).__name__
method = f"save_{typename}"
assert hasattr(self, method), f"Unknown object type {typename}"
getattr(self, method)(thing)
Its constructor creates an empty set of IDs seen so far.
If SaveAlias.save
notices that the object it’s about to save
has been saved before,
it writes a line like this:
alias:12345678:
where 12345678
is the object’s ID.
(The exercises will ask why the trailing colon needs to be there.)
If the object hasn’t been seen before,
SaveAlias
saves the object’s type,
its ID,
and either its value or its length:
def save_list(self, thing):
self._write("list", id(thing), len(thing))
for item in thing:
self.save(item)
SaveAlias._list
is a little different from SaveObjects._list
because it has to save each object’s identifier
along with its type and its value or length.
Our LoadAlias
class needs a similar change compared to LoadObjects
.
The first version is shown below;
as we will see,
it contains a subtle bug:
class LoadAlias(LoadObjects):
def __init__(self, reader):
super().__init__(reader)
self.seen = {}
def load(self):
line = self.reader.readline()[:-1]
assert line, "Nothing to read"
fields = line.split(":", maxsplit=2)
assert len(fields) == 3, f"Badly-formed line {line}"
key, ident, value = fields
# the lines below contain a bug
if key == "alias":
assert ident in self.seen
return self.seen[ident]
method = f"load_{key}"
assert hasattr(self, method), f"Unknown object type {key}"
result = getattr(self, method)(value)
self.seen[ident] = result
return result
The first test of our new code is:
def test_aliasing_no_aliasing():
fixture = ["a", {"b": True, 7: {"c": "d"}}]
assert roundtrip(fixture) == fixture
which uses this helper function:
def roundtrip(fixture):
writer = StringIO()
SaveAlias(writer).save(fixture)
reader = StringIO(writer.getvalue())
return LoadAlias(reader).load()
There isn’t any aliasing in the test case, but that’s deliberate: we want to make sure we haven’t broken code that was working before we move on. Here’s a test that actually includes some aliasing:
def test_aliasing_shared_child():
shared = ["content"]
fixture = [shared, shared]
result = roundtrip(fixture)
assert result == fixture
assert id(result[0]) == id(result[1])
result[0][0] = "changed"
assert result[1][0] == "changed"
It checks that the aliased sub-list is actually aliased after the data is restored, then checks that changes to the sub-list through one alias show up through the other. The second check ought to be redundant, but it’s still comforting.
There’s one more case to check, and unfortunately it reveals a bug. The two lines:
fixture = []
fixture.append(fixture)
create the data structure shown in Figure 16.3,
in which an object contains a reference to itself.
Our code ought to handle this case but doesn’t:
when we try to read in the saved data,
LoadAlias.load
sees the alias
line
but then says it can’t find the object being referred to.
The problem is these lines in LoadAlias.load
marked as containing a bug,
in combination with these lines inherited from LoadObjects
:
def load_list(self, value):
return [self.load() for _ in range(int(value))]
Let’s trace execution for the saved data:
list:4484025600:1
alias:4484025600:
-
The first line tells us that there’s a list whose ID is
4484025600
so weLoadObjects._list
to load a list of one element. -
LoadObjects._list
calledLoadAlias.load
recursively to load that one element. -
LoadAlias.load
reads the second line of saved data, which tells it to re-use the data whose ID is4484025600
. ButLoadObjects._list
hasn’t created that list yet—it is still reading the elements—soLoadAlias.load
hasn’t added the list toseen
.
The solution is to reorder the operations,
which unfortunately means writing new versions
of all the methods defined in LoadObjects
.
The new implementation of _list
is:
def load_list(self, ident, length):
result = []
self.seen[ident] = result
for _ in range(int(length)):
result.append(self.load())
return result
This method creates the list it’s going to return,
adds that list to the seen
dictionary immediately,
and then loads list items recursively.
We have to pass it the ID of the list
to use as the key in seen
,
and we have to use a loop rather than
a list comprehension,
but the changes to save_set
and save_dict
follow exactly the same pattern.
word = "word"
child = [word, word]
parent = []
parent.append(parent)
parent.append(child)
saver = SaveAlias(sys.stdout)
saver.save(parent)
list:4539747200:2
alias:4539747200:
list:4539552960:2
str:4539552048:1
word
alias:4539552048:
Summary
Figure 16.4 summarizes the ideas introduced in this chapter, while Appendix B shows how to extend our framework to handle user-defined classes.
Exercises
Dangling Colon
Why is there a colon at the end of the line alias:12345678:
when we create an alias marker?
Versioning
We now have several versions of our data storage format. Early versions of our code can’t read the archives created by later ones, and later ones can’t read the archives created early on (which used two fields per line rather than three). This problem comes up all the time in long-lived libraries and applications, and the usual solution is to include some sort of version marker at the start of each archive to indicate what version of the software created it (and therefore how it should be read). Modify the code we have written so far to do this.
Strings
Modify the framework so that strings are stored using escape characters like \n
instead of being split across several lines.
Who Calculates?
Why doesn’t LoadAlias.load
calculate object IDs?
Why does it use the IDs saved in the archive instead?
Using Globals
The lesson on unit testing introduced the function globals
,
which can be used to look up everything defined at the top level of a program.
-
Modify the persistence framework so that it looks for
save_
andload_
functions usingglobals
. -
Why is this a bad idea?