A File Archiver

Version control tools use hashing to uniquely identify each saved file.
Each snapshot of a set of files is recorded in a manifest.
Using a mock filesystem for testing is safer and faster than using the real thing.
Operations involving multiple files may suffer from race conditions.
Use a base class to specify what a component must be able to do and derive child classes to implement those operations.

Terms defined: atomic operation, base class, data migration, compression (of file), file locking, helper function, manifest, race condition, successive refinement, timestamp, time of check - time of use, top-down design, Coordinated Universal Time, version control system

We’ve written almost a thousand lines of Python so far. We could recreate it if we had to, but we’d rather not have to. We’d also like to be able to see what we’ve changed and to collaborate with other people.

A version control system like Git solves all of these problems at once. It keeps track of changes to files so that we can see what we’ve changed, recover old versions, and merge our changes with those made by other people.

The core of a modern version control tool is a way to archive files that:

records which versions of which files existed at the same time, so that we can go back to a consistent previous state, and
stores any particular version of a file only once, so that we don’t waste disk space.

This chapter builds a tool that does both tasks. It won’t create and merge branches; if you would like to see how that works, please see Mary Rose Cook’s Gitlet or Thibault Polge’s Write yourself a Git.

Saving Files

Many files only change occasionally after they’re created, or not at all. It would be wasteful for a version control system to make copies each time the user saved a snapshot of a project, so instead we will copy each unique file to something like abcd1234.bck, where abcd1234 is the hash of the file’s contents (Chapter 3). We will then record the filenames and hash keys in each snapshot: The hash keys tell us which unique files existed at the time of the snapshot, while the filenames tell us what the file’s contents were named when the snapshot was made. To restore a particular snapshot, we will copy the .bck files back to their original locations (Figure 10.1).

Backup file storage — Figure 10.1: Organization of backup file storage.

The first step is to find all the files in or below a given directory that we need to save. As described in Chapter 4, Python’s glob module can do this for us. Let’s use this to create a table of files and hashes:

HASH_LEN = 16

def hash_all(root):
    result = []
    for name in glob("**/*.*", root_dir=root, recursive=True):
        full_name = Path(root, name)
        with open(full_name, "rb") as reader:
            data = reader.read()
            hash_code = sha256(data).hexdigest()[:HASH_LEN]
            result.append((name, hash_code))
    return result

Notice that we’re truncating the hash code of each file to just 16 hexadecimal digits. This greatly increases the odds of collision, so real version control systems don’t do this, but it makes our program’s output easier to show on screen. For example, if our test directory looks like this:

sample_dir
|-- a.txt
|-- b.txt
`-- sub_dir
    `-- c.txt

1 directory, 3 files

then our program’s output is:

python hash_all.py sample_dir

filename,hash
b.txt,3cf9a1a81f6bdeaf
a.txt,17e682f060b5f8e4
sub_dir/c.txt,5695d82a086b6779

Testing

Before we go any further we need to figure out how we’re going to test our code. The obvious approach is to create directories and sub-directories containing some files we can use as fixtures. However, we are going to change or delete those files as we back things up and restore them. To make sure early tests don’t contaminate later ones, we would have to recreate those files and directories after each test.

As discussed in Chapter 9, a better approach is to use a mock object instead of the real filesystem. The pyfakefs module replaces key functions like open with functions that behave the same way but act on “files” stored in memory (Figure 10.2). Using it prevents our tests from accidentally disturbing the filesystem; it also makes tests much faster since in-memory operations are thousands of times faster than ones that touch the disk.

Mock filesystem — Figure 10.2: Using a mock filesystem to simplify testing.

If we import pyfakefs, we automatically get a fixture called fs that we can use to create files. We tell pytest we want to use this fixture by passing it as an argument to our testing function:

from pathlib import Path

def test_simple_example(fs):
    sentence = "This file contains one sentence."
    with open("alpha.txt", "w") as writer:
        writer.write(sentence)
    assert Path("alpha.txt").exists()
    with open("alpha.txt", "r") as reader:
        assert reader.read() == sentence

We can use fs to create more complicated fixtures of our own with multiple directories and files:

from pathlib import Path
import pytest

@pytest.fixture
def our_fs(fs):
    fs.create_file("a.txt", contents="aaa")
    fs.create_file("b.txt", contents="bbb")
    fs.create_file("sub_dir/c.txt", contents="ccc")

def test_nested_example(our_fs):
    assert Path("a.txt").exists()
    assert Path("b.txt").exists()
    assert Path("sub_dir/c.txt").exists()

def test_deletion_example(our_fs):
    assert Path("a.txt").exists()
    Path("a.txt").unlink()
    assert not Path("a.txt").exists()

and then test that hash_all finds all the files:

import pytest

from hash_all import hash_all, HASH_LEN

@pytest.fixture
def our_fs(fs):
    fs.create_file("a.txt", contents="aaa")
    fs.create_file("b.txt", contents="bbb")
    fs.create_file("sub_dir/c.txt", contents="ccc")

def test_hashing(our_fs):
    result = hash_all(".")
    expected = {"a.txt", "b.txt", "sub_dir/c.txt"}
    assert {r[0] for r in result} == expected
    assert all(len(r[1]) == HASH_LEN for r in result)

# [change]
def test_change(our_fs):
    original = hash_all(".")
    original = [entry for entry in original if entry[0] == "a.txt"][0]
    with open("a.txt", "w") as writer:
        writer.write("this is new content for a.txt")
    changed = hash_all(".")
    changed = [entry for entry in changed if entry[0] == "a.txt"][0]
    assert original != changed
# [/change]

and that hashes change when files change:

def test_change(our_fs):
    original = hash_all(".")
    original = [entry for entry in original if entry[0] == "a.txt"][0]
    with open("a.txt", "w") as writer:
        writer.write("this is new content for a.txt")
    changed = hash_all(".")
    changed = [entry for entry in changed if entry[0] == "a.txt"][0]
    assert original != changed

Tracking Backups

The second part of our backup tool keeps track of which files have and haven’t been backed up already. It stores backups in a directory that contains files like abcd1234.bck (the hash followed by .bck) and creates a manifest that describes the content of each snapshot. A real system would support remote storage as well so that losing one hard drive wouldn’t mean losing all our work, so we need to design our system with multiple back ends in mind.

For now, we will store manifests in CSV files named ssssssssss.csv, where ssssssssss is the UTC timestamp of the backup’s creation.

Time of Check/Time of Use

Our naming convention for manifests will fail if we try to create two or more backups in the same second. This might seem unlikely, but many faults and security holes are the result of programmers assuming things weren’t going to happen.

We could try to avoid this problem by using a two-part naming scheme ssssssss-a.csv, ssssssss-b.csv, and so on, but this leads to a race condition called time of check/time of use. If two users run the backup tool at the same time, they will both see that there isn’t a file (yet) with the current timestamp, so they will both try to create the first one. Ensuring that multi-file updates are atomic operations (i.e., that they always appear to be a single indivisible step) is a hard problem; file locking is a common approach, but complete solutions are out of the scope of this book.

This function creates a backup—or rather, it will once we fill in all the functions it depends on:

def backup(source_dir, backup_dir):
    manifest = hash_all(source_dir)
    timestamp = current_time()
    write_manifest(backup_dir, timestamp, manifest)
    copy_files(source_dir, backup_dir, manifest)
    return manifest

Writing a high-level function first and then filling in the things it needs is called successive refinement or top-down design. In practice, nobody designs code and then implements the design without changes unless they have solved closely-related problems before [Petre2016]. Instead, good programmers jump back and forth between higher and lower levels of design, adjusting their overall strategy as work on low-level details reveals problems or opportunities they hadn’t foreseen.

When writing the manifest, we check that the backup directory exists, create it if it does not, and then save the manifest as CSV:

def write_manifest(backup_dir, timestamp, manifest):
    backup_dir = Path(backup_dir)
    if not backup_dir.exists():
        backup_dir.mkdir()
    manifest_file = Path(backup_dir, f"{timestamp}.csv")
    with open(manifest_file, "w") as raw:
        writer = csv.writer(raw)
        writer.writerow(["filename", "hash"])
        writer.writerows(manifest)

We then copy those files that haven’t already been saved:

def copy_files(source_dir, backup_dir, manifest):
    for (filename, hash_code) in manifest:
        source_path = Path(source_dir, filename)
        backup_path = Path(backup_dir, f"{hash_code}.bck")
        if not backup_path.exists():
            shutil.copy(source_path, backup_path)

We have introduced several more race conditions here: for example, if two people are creating backups at the same time, they could both discover that the backup directory doesn’t exist and then both try to create it. Whoever does so first will succeed, but whoever comes second will fail. We will look at ways to fix this in the exercises as well.

What Time Is It?

Our backup function relies on a helper function called current_time that does nothing but call time.time from Python’s standard library:

def current_time():
    return f"{time.time()}".split(".")[0]

We could call time.time directly, but wrapping it up like this makes it easier to replace with a mock for testing.

Let’s do one test with real files:

BACKUPS=/tmp/backups
rm -rf $BACKUPS
python backup.py sample_dir $BACKUPS
tree --charset ascii $BACKUPS

/tmp/backups
|-- 1695482691.csv
|-- 17e682f060b5f8e4.bck
|-- 3cf9a1a81f6bdeaf.bck
`-- 5695d82a086b6779.bck

0 directories, 4 files

The rest of our tests use a fake filesystem and a mock replacement for the current_time function (so that we know what the manifest file will be called). The setup is:

FILES = {"a.txt": "aaa", "b.txt": "bbb", "sub_dir/c.txt": "ccc"}

@pytest.fixture
def our_fs(fs):
    for name, contents in FILES.items():
        fs.create_file(name, contents=contents)

and an example of a single test is:

def test_nested_example(our_fs):
    timestamp = 1234
    with patch("backup.current_time", return_value=timestamp):
        manifest = backup(".", "/backup")
    assert Path("/backup", f"{timestamp}.csv").exists()
    for filename, hash_code in manifest:
        assert Path("/backup", f"{hash_code}.bck").exists()

Refactoring

Now that we have a better idea of what we’re doing, we can refactor to create a base class that prescribes the general steps in creating a backup:

class Archive:
    def __init__(self, source_dir):
        self._source_dir = source_dir

    def backup(self):
        manifest = hash_all(self._source_dir)
        self._write_manifest(manifest)
        self._copy_files(manifest)
        return manifest

We can then derive a child class to archive things locally and fill in its methods by re-using code from the functions we have just written. Once we’ve done this, we can create the specific archiver we want with a single line:

    archiver = ArchiveLocal(source_dir, backup_dir)

Doing this makes life easier when we want to write archivers that behave the same way but work differently. For example, we could create an archiver that compresses the files it archives by deriving a new class from ArchiveLocal and writing a new _copy_files method. More importantly, other code can use an archiver without knowing what it’s doing. For example, the function analyze_and_save reads some data, analyzes it, saves the results, and then creates an archive of those results. It doesn’t know whether the archive is compressing files or whether they’re being saved locally or remotely.

def analyze_and_save(options, archiver):
    data = read_data(options)
    results = analyze_data(data)
    save_everything(results)
    archiver.backup()

This example highlights one of the strengths of object-oriented programming: it allows old code to use new code without any changes.

Summary

Figure 10.3 summarizes the key ideas in this chapter, which are the foundation of most modern tools for doing backups and version control.

Concept map of file backup — Figure 10.3: Concept map for hashing-based file backup.

Exercises

Sequencing Backups

Modify the backup program so that manifests are numbered sequentially as 00000001.csv, 00000002.csv, and so on rather than being timestamped. Why doesn’t this solve the time of check/time of use race condition mentioned earlier?

JSON Manifests

Modify backup.py so that it can save JSON manifests as well as CSV manifests based on a command-line flag.
Write another program called migrate.py that converts a set of manifests from CSV to JSON. (The program’s name comes from the term data migration.)
Modify backup.py programs so that each manifest stores the user name of the person who created it along with file hashes, and then modify migrate.py to transform old files into the new format.

Mock Hashes

Modify the file backup program so that it uses a function called ourHash to hash files.
Create a replacement that returns some predictable value, such as the first few characters of the data.
Rewrite the tests to use this function. How did you modify the main program so that the tests could control which hashing function is used?

Comparing Manifests

Write a program compare-manifests.py that reads two manifest files and reports:

Which files have the same names but different hashes (i.e., their contents have changed).
Which files have the same hashes but different names (i.e., they have been renamed).
Which files are in the first hash but neither their names nor their hashes are in the second (i.e., they have been deleted).
Which files are in the second hash but neither their names nor their hashes are in the first (i.e., they have been added).

From One State to Another

Write a program called from_to.py that takes a directory and a manifest file as command-line arguments, then adds, removes, and/or renames files in the directory to restore the state described in the manifest. The program should only perform file operations when it needs to, e.g., it should not delete a file and re-add it if the contents have not changed.
Write some tests for from_to.py using pytest and a mock filesystem.

File History

Write a program called file_history.py that takes the name of a file as a command-line argument and displays the history of that file by tracing it back in time through the available manifests.
Write tests for your program using pytest and a mock filesystem.

Pre-commit Hooks

Modify backup.py to load and run a function called pre_commit from a file called pre_commit.py stored in the root directory of the files being backed up. If pre_commit returns True, the backup proceeds; if it returns False or raises an exception, no backup is created.