A File Archiver
- Version control tools use hashing to uniquely identify each saved file.
- Each snapshot of a set of files is recorded in a manifest.
- Using a mock filesystem for testing is safer and faster than using the real thing.
- Operations involving multiple files may suffer from race conditions.
- Use a base class to specify what a component must be able to do and derive child classes to implement those operations.
Terms defined: atomic operation, base class, data migration, compression (of file), file locking, helper function, manifest, race condition, successive refinement, timestamp, time of check - time of use, top-down design, Coordinated Universal Time, version control system
We’ve written almost a thousand lines of Python so far. We could recreate it if we had to, but we’d rather not have to. We’d also like to be able to see what we’ve changed and to collaborate with other people.
A version control system like Git solves all of these problems at once. It keeps track of changes to files so that we can see what we’ve changed, recover old versions, and merge our changes with those made by other people.
The core of a modern version control tool is a way to archive files that:
-
records which versions of which files existed at the same time, so that we can go back to a consistent previous state, and
-
stores any particular version of a file only once, so that we don’t waste disk space.
This chapter builds a tool that does both tasks. It won’t create and merge branches; if you would like to see how that works, please see Mary Rose Cook’s Gitlet or Thibault Polge’s Write yourself a Git.
Saving Files
Many files only change occasionally after they’re created, or not at all.
It would be wasteful for a version control system to make copies
each time the user saved a snapshot of a project,
so instead we will copy each unique file to something like abcd1234.bck
,
where abcd1234
is the hash of the file’s contents (Chapter 3).
We will then record the filenames and hash keys in each snapshot:
The hash keys tell us which unique files existed at the time of the snapshot,
while the filenames tell us what the file’s contents were named when the snapshot was made.
To restore a particular snapshot,
we will copy the .bck
files back to their original locations
(Figure 10.1).
The first step is to find all the files in or below a given directory
that we need to save.
As described in Chapter 4,
Python’s glob
module can do this for us.
Let’s use this to create a table of files and hashes:
HASH_LEN = 16
def hash_all(root):
result = []
for name in glob("**/*.*", root_dir=root, recursive=True):
full_name = Path(root, name)
with open(full_name, "rb") as reader:
data = reader.read()
hash_code = sha256(data).hexdigest()[:HASH_LEN]
result.append((name, hash_code))
return result
Notice that we’re truncating the hash code of each file to just 16 hexadecimal digits. This greatly increases the odds of collision, so real version control systems don’t do this, but it makes our program’s output easier to show on screen. For example, if our test directory looks like this:
sample_dir
|-- a.txt
|-- b.txt
`-- sub_dir
`-- c.txt
1 directory, 3 files
then our program’s output is:
python hash_all.py sample_dir
filename,hash
b.txt,3cf9a1a81f6bdeaf
a.txt,17e682f060b5f8e4
sub_dir/c.txt,5695d82a086b6779
Testing
Before we go any further we need to figure out how we’re going to test our code. The obvious approach is to create directories and sub-directories containing some files we can use as fixtures. However, we are going to change or delete those files as we back things up and restore them. To make sure early tests don’t contaminate later ones, we would have to recreate those files and directories after each test.
As discussed in Chapter 9,
a better approach is to use a mock object
instead of the real filesystem.
The pyfakefs
module replaces key functions like open
with functions that behave the same way
but act on “files” stored in memory
(Figure 10.2).
Using it prevents our tests from accidentally disturbing the filesystem;
it also makes tests much faster
since in-memory operations are thousands of times faster than ones that touch the disk.
If we import pyfakefs
,
we automatically get a fixture called fs
that we can use to create files.
We tell pytest we want to use this fixture
by passing it as an argument to our testing function:
from pathlib import Path
def test_simple_example(fs):
sentence = "This file contains one sentence."
with open("alpha.txt", "w") as writer:
writer.write(sentence)
assert Path("alpha.txt").exists()
with open("alpha.txt", "r") as reader:
assert reader.read() == sentence
We can use fs
to create more complicated fixtures of our own
with multiple directories and files:
from pathlib import Path
import pytest
@pytest.fixture
def our_fs(fs):
fs.create_file("a.txt", contents="aaa")
fs.create_file("b.txt", contents="bbb")
fs.create_file("sub_dir/c.txt", contents="ccc")
def test_nested_example(our_fs):
assert Path("a.txt").exists()
assert Path("b.txt").exists()
assert Path("sub_dir/c.txt").exists()
def test_deletion_example(our_fs):
assert Path("a.txt").exists()
Path("a.txt").unlink()
assert not Path("a.txt").exists()
and then test that hash_all
finds all the files:
import pytest
from hash_all import hash_all, HASH_LEN
@pytest.fixture
def our_fs(fs):
fs.create_file("a.txt", contents="aaa")
fs.create_file("b.txt", contents="bbb")
fs.create_file("sub_dir/c.txt", contents="ccc")
def test_hashing(our_fs):
result = hash_all(".")
expected = {"a.txt", "b.txt", "sub_dir/c.txt"}
assert {r[0] for r in result} == expected
assert all(len(r[1]) == HASH_LEN for r in result)
# [change]
def test_change(our_fs):
original = hash_all(".")
original = [entry for entry in original if entry[0] == "a.txt"][0]
with open("a.txt", "w") as writer:
writer.write("this is new content for a.txt")
changed = hash_all(".")
changed = [entry for entry in changed if entry[0] == "a.txt"][0]
assert original != changed
# [/change]
and that hashes change when files change:
def test_change(our_fs):
original = hash_all(".")
original = [entry for entry in original if entry[0] == "a.txt"][0]
with open("a.txt", "w") as writer:
writer.write("this is new content for a.txt")
changed = hash_all(".")
changed = [entry for entry in changed if entry[0] == "a.txt"][0]
assert original != changed
Tracking Backups
The second part of our backup tool
keeps track of which files have and haven’t been backed up already.
It stores backups in a directory that contains files like abcd1234.bck
(the hash followed by .bck
)
and creates a manifest
that describes the content of each snapshot.
A real system would support remote storage as well
so that losing one hard drive wouldn’t mean losing all our work,
so we need to design our system with multiple back ends in mind.
For now,
we will store manifests in CSV files named ssssssssss.csv
,
where ssssssssss
is the UTC timestamp
of the backup’s creation.
Time of Check/Time of Use
Our naming convention for manifests will fail if we try to create two or more backups in the same second. This might seem unlikely, but many faults and security holes are the result of programmers assuming things weren’t going to happen.
We could try to avoid this problem by using a two-part naming scheme ssssssss-a.csv
,
ssssssss-b.csv
, and so on,
but this leads to a race condition
called time of check/time of use.
If two users run the backup tool at the same time,
they will both see that there isn’t a file (yet) with the current timestamp,
so they will both try to create the first one.
Ensuring that multi-file updates are atomic operations
(i.e., that they always appear to be a single indivisible step)
is a hard problem;
file locking is a common approach,
but complete solutions are out of the scope of this book.
This function creates a backup—or rather, it will once we fill in all the functions it depends on:
def backup(source_dir, backup_dir):
manifest = hash_all(source_dir)
timestamp = current_time()
write_manifest(backup_dir, timestamp, manifest)
copy_files(source_dir, backup_dir, manifest)
return manifest
Writing a high-level function first and then filling in the things it needs is called successive refinement or top-down design. In practice, nobody designs code and then implements the design without changes unless they have solved closely-related problems before [Petre2016]. Instead, good programmers jump back and forth between higher and lower levels of design, adjusting their overall strategy as work on low-level details reveals problems or opportunities they hadn’t foreseen.
When writing the manifest, we check that the backup directory exists, create it if it does not, and then save the manifest as CSV:
def write_manifest(backup_dir, timestamp, manifest):
backup_dir = Path(backup_dir)
if not backup_dir.exists():
backup_dir.mkdir()
manifest_file = Path(backup_dir, f"{timestamp}.csv")
with open(manifest_file, "w") as raw:
writer = csv.writer(raw)
writer.writerow(["filename", "hash"])
writer.writerows(manifest)
We then copy those files that haven’t already been saved:
def copy_files(source_dir, backup_dir, manifest):
for (filename, hash_code) in manifest:
source_path = Path(source_dir, filename)
backup_path = Path(backup_dir, f"{hash_code}.bck")
if not backup_path.exists():
shutil.copy(source_path, backup_path)
We have introduced several more race conditions here: for example, if two people are creating backups at the same time, they could both discover that the backup directory doesn’t exist and then both try to create it. Whoever does so first will succeed, but whoever comes second will fail. We will look at ways to fix this in the exercises as well.
What Time Is It?
Our backup
function relies on a helper function
called current_time
that does nothing but call time.time
from
Python’s standard library:
def current_time():
return f"{time.time()}".split(".")[0]
We could call time.time
directly,
but wrapping it up like this makes it easier to replace with a mock for testing.
Let’s do one test with real files:
BACKUPS=/tmp/backups
rm -rf $BACKUPS
python backup.py sample_dir $BACKUPS
tree --charset ascii $BACKUPS
/tmp/backups
|-- 1695482691.csv
|-- 17e682f060b5f8e4.bck
|-- 3cf9a1a81f6bdeaf.bck
`-- 5695d82a086b6779.bck
0 directories, 4 files
The rest of our tests use a fake filesystem
and a mock replacement for the current_time
function
(so that we know what the manifest file will be called).
The setup is:
FILES = {"a.txt": "aaa", "b.txt": "bbb", "sub_dir/c.txt": "ccc"}
@pytest.fixture
def our_fs(fs):
for name, contents in FILES.items():
fs.create_file(name, contents=contents)
and an example of a single test is:
def test_nested_example(our_fs):
timestamp = 1234
with patch("backup.current_time", return_value=timestamp):
manifest = backup(".", "/backup")
assert Path("/backup", f"{timestamp}.csv").exists()
for filename, hash_code in manifest:
assert Path("/backup", f"{hash_code}.bck").exists()
Refactoring
Now that we have a better idea of what we’re doing, we can refactor to create a base class that prescribes the general steps in creating a backup:
class Archive:
def __init__(self, source_dir):
self._source_dir = source_dir
def backup(self):
manifest = hash_all(self._source_dir)
self._write_manifest(manifest)
self._copy_files(manifest)
return manifest
We can then derive a child class to archive things locally and fill in its methods by re-using code from the functions we have just written. Once we’ve done this, we can create the specific archiver we want with a single line:
archiver = ArchiveLocal(source_dir, backup_dir)
Doing this makes life easier when we want to write archivers
that behave the same way but work differently.
For example,
we could create an archiver that compresses
the files it archives
by deriving a new class from ArchiveLocal
and writing a new _copy_files
method.
More importantly,
other code can use an archiver without knowing what it’s doing.
For example,
the function analyze_and_save
reads some data,
analyzes it,
saves the results,
and then creates an archive of those results.
It doesn’t know whether the archive is compressing files
or whether they’re being saved locally or remotely.
def analyze_and_save(options, archiver):
data = read_data(options)
results = analyze_data(data)
save_everything(results)
archiver.backup()
This example highlights one of the strengths of object-oriented programming: it allows old code to use new code without any changes.
Summary
Figure 10.3 summarizes the key ideas in this chapter, which are the foundation of most modern tools for doing backups and version control.
Exercises
Sequencing Backups
Modify the backup program so that manifests are numbered sequentially
as 00000001.csv
, 00000002.csv
, and so on
rather than being timestamped.
Why doesn’t this solve the time of check/time of use race condition mentioned earlier?
JSON Manifests
-
Modify
backup.py
so that it can save JSON manifests as well as CSV manifests based on a command-line flag. -
Write another program called
migrate.py
that converts a set of manifests from CSV to JSON. (The program’s name comes from the term data migration.) -
Modify
backup.py
programs so that each manifest stores the user name of the person who created it along with file hashes, and then modifymigrate.py
to transform old files into the new format.
Mock Hashes
-
Modify the file backup program so that it uses a function called
ourHash
to hash files. -
Create a replacement that returns some predictable value, such as the first few characters of the data.
-
Rewrite the tests to use this function. How did you modify the main program so that the tests could control which hashing function is used?
Comparing Manifests
Write a program compare-manifests.py
that reads two manifest files and reports:
-
Which files have the same names but different hashes (i.e., their contents have changed).
-
Which files have the same hashes but different names (i.e., they have been renamed).
-
Which files are in the first hash but neither their names nor their hashes are in the second (i.e., they have been deleted).
-
Which files are in the second hash but neither their names nor their hashes are in the first (i.e., they have been added).
From One State to Another
-
Write a program called
from_to.py
that takes a directory and a manifest file as command-line arguments, then adds, removes, and/or renames files in the directory to restore the state described in the manifest. The program should only perform file operations when it needs to, e.g., it should not delete a file and re-add it if the contents have not changed. -
Write some tests for
from_to.py
using pytest and a mock filesystem.
File History
-
Write a program called
file_history.py
that takes the name of a file as a command-line argument and displays the history of that file by tracing it back in time through the available manifests. -
Write tests for your program using pytest and a mock filesystem.
Pre-commit Hooks
Modify backup.py
to load and run a function called pre_commit
from a file called pre_commit.py
stored in the root directory of the files being backed up.
If pre_commit
returns True
, the backup proceeds;
if it returns False
or raises an exception,
no backup is created.