A File Cache

Software systems often use caches to store a subset of files in order to use less disk space.
Caching systems can replace actual files with placeholders containing metadata.
Object-oriented systems are often implemented in stages to break large design problems into smaller, more manageable ones.
In a good design, derived classes only have to override a few (preferably none) of the methods implemented in parent classes.
Implementing a minimum testable class allows early testing of core functionality.

Terms defined: authentication, minimum testable class, named tuple, placeholder file

Data scientists often want to analyze the same files in several projects. Those files might be too sensitive or too large to store in version control. Even when that’s not the case, storing several copies eventually results in someone analyzing an out-of-date version of the data.

Tools like Git LFS and DVC are an alternative. These tools replace the actual data file with a placeholder file that contains a unique ID, such as a hash of the data file’s contents (Figure 31.1). The placeholder file is small, so it can be committed to version control, and if the file it refers to changes, that change will be visible in the version control history.

Cache architecture — Figure 31.1: Architecture of a local file cache.

The data file is stored on a server or in the cloud and downloaded on demand. Since the data file might be shared between several projects, it is usually not downloaded into any particular project. Instead, it is stored in a cache that all projects have access to. If the cache gets too large, files in it can be deleted and re-downloaded later.

This architecture has four parts: the permanent archive of data files, an index of all the files it contains, the cache of locally-available files, and the placeholder files that are actually saved in version control. There are several ways to implement each of these, so our implementation will give us a chance to explore some ideas in object-oriented programming.

A File Index

Let’s start by creating the index of archived files. We need to record each file’s unique identifier and the time it was added to the archive; we could define a class for this, but since instances of that class are just passive data records, we use a named tuple instead:

from collections import namedtuple

CacheEntry = namedtuple("CacheEntry", ["identifier", "timestamp"])

Next, we create an abstract class that defines the operation an index can do, but doesn’t actually implement them—or rather, only implements the ones that can be defined in terms of lower-level operations that derived class will implement:

class IndexBase(ABC):
    TIME_FORMAT = "%Y-%m-%d:%H:%M:%S"

    def __init__(self, index_dir=None):
        self.set_index_dir(index_dir)

    def get_index_dir(self):
        return self.index_dir

    def set_index_dir(self, index_dir):
        self.index_dir = index_dir
        self._initialize_index()

    def has(self, identifier):
        index = self.load()
        return any(entry.identifier == identifier for entry in index)

    def known(self):
        return {entry.identifier for entry in self.load()}

    def add(self, identifier):
        index = self.load()
        entry = CacheEntry(identifier, current_time())
        index.append(entry)
        self.save(index)

The methods of IndexBase can:

get and set the directory where the index file is stored;
check if a particular file is known (using its identifier, not its name);
get a list of all known files (again, by identifier rather than name); and
add an identifier to the index with a timestamp.

These methods rely on three abstract methods to initialize the index if it doesn’t already exist, load the index, and save the index. Their definitions are:

    @abstractmethod
    def load(self):
        """Load entire index."""

    @abstractmethod
    def save(self, index):
        """Save entire index."""

    @abstractmethod
    def _initialize_index(self):
        """Initialize index file if necessary."""

Finally, we define a function current_time that returns a time we can use in an index record. Doing this gives us something that will be easy to mock for testing:

def current_time():
    return datetime.datetime.now()

Naming Things

The index stores files’ unique IDs rather than their names because the latter are unreliable. A file may be moved or renamed several times within a project, or may have different names in different projects; our design therefore relies on something intrinsic to the file rather than something under the user’s control.

The price for this decision is that files’ “names” are meaningless. Whether we use sequence numbers or hashes, remembering that health_records.csv is file 177234 increases the cognitive load on our users.

By writing IndexBase, we have turned a large, amorphous design problem into a smaller and more focused one (Figure 31.2). For example, if we decide to store the index as a JSON file or in a SQLite database, we should only need to provide the methods to handle that file format; we should not need to re-write or re-test any of the other methods. This rule gives us a way to evaluate our designs: the fewer methods we have to override, the better the original division of labor was.

To try it out, let’s implement a concrete index class that stores data in a CSV file. load checks that the cache directory exists and that it contains an index file, then reads that file and turns its contents into CacheEntry records:

class IndexCSV(IndexBase):
    def load(self):
        if not self.index_dir:
            raise CacheException("Cache directory not set in index")

        index_path = self._make_index_path()
        if not index_path.exists():
            raise CacheException(f"Index file {index_path} not found")

        with open(index_path, "r") as stream:
            reader = csv.reader(stream)
            return [
                CacheEntry(r[0], datetime.strptime(r[1], self.TIME_FORMAT))
                for r in reader
            ]

Cache index methods — Figure 31.2: Calling relationships between cache index methods.

Similarly, save checks that the cache directory exists and then writes each record to the index file. This method automatically creates the index file if it doesn’t already exist:

    def save(self, index):
        if not self.index_dir:
            raise CacheException("Cache directory not set in index")

        index_path = self._make_index_path()
        with open(index_path, "w") as stream:
            writer = csv.writer(stream)
            for entry in index:
                when = entry.timestamp.strftime(self.TIME_FORMAT)
                writer.writerow((entry.identifier, when))

Finally, we need to write _initialize_index, which is called by the set_index_dir method in the base class rather than being invoked directly by the user. While we’re doing that, we’ll define the name for the index file and write a helper method that produces its full path:

    INDEX_FILE = "index.csv"

    def _initialize_index(self):
        self._make_index_path().touch()

    def _make_index_path(self):
        if not self.index_dir:
            raise CacheException("Cache directory not set in index")
        return Path(self.index_dir, self.INDEX_FILE)

Time to do some testing. As in Chapter 10, we will use pyfakefs instead of creating files on disk and pytest’s fixture decorator to set things up:

CACHE_DIR = Path("/cache")
INDEX_FILE = Path(CACHE_DIR, IndexCSV.INDEX_FILE)
LOCAL_DIR = Path(".")

@pytest.fixture
def disk(fs):
    fs.create_dir(CACHE_DIR)

If we create a new cache, it should be empty:

def test_csv_loads_initially(disk):
    index = IndexCSV(CACHE_DIR)
    assert index.load() == []
    assert index.known() == set()

If we add an entry, it should still be there when we reload the index:

def test_csv_saves_changes(disk):
    right_now = datetime(2022, 12, 12)
    index = IndexCSV(CACHE_DIR)
    with patch("index_base.current_time", return_value=right_now):
        index.add("abcd1234")
    assert index.load() == [CacheEntry("abcd1234", right_now)]
    assert index.known() == {"abcd1234"}

And if we check whether or not something is in the index, the answer should be “yes” for things we’ve added and “no” for things we haven’t:

def test_csv_has_entry(disk):
    right_now = datetime(2022, 12, 12)
    index = IndexCSV(CACHE_DIR)
    with patch("index_base.current_time", return_value=right_now):
        index.add("abcd1234")
    assert index.has("abcd1234")
    assert not index.has("dcba4321")

A Local Cache

Just as IndexBase defined the behavior every cache index must have, CacheBase defines the required behavior of caches themselves. As the listing below shows, it is less than three dozen lines long:

class CacheBase(ABC):
    CACHE_SUFFIX = "cache"

    def __init__(self, index, cache_dir):
        self.index = index
        self.cache_dir = cache_dir

    def add(self, local_path):
        identifier = self._make_identifier(local_path)
        self.index.add(identifier)
        self._add(identifier, local_path)
        return identifier

    def get_cache_path(self, identifier):
        if not self.index.has(identifier):
            raise CacheException(f"Unknown file identifier {identifier}")
        return self._make_cache_path(identifier)

    def has(self, identifier):
        return self.index.has(identifier)

    def known(self):
        return self.index.known()

    def _make_identifier(self, local_path):
        with open(local_path, "rb") as reader:
            return hash_stream(reader)

    def _make_cache_path(self, identifier):
        return Path(self.cache_dir, f"{identifier}.{self.CACHE_SUFFIX}")

    @abstractmethod
    def _add(self, identifier, local_path):
        """Add a file with a given identifer from a given local path."""

CacheBase encapsulates several design decisions. Cached files are named identifier.cache, and are all stored in a single directory. This scheme works well if the cache only ever has a few thousand files, but if that number might grow into the tens or hundreds of thousands, we might want to divide the cached files between several sub-directories (which is what most browsers do).

All of CacheBase’s operations are implemented in terms of an index derived from IndexBase, a helper method that makes a file identifier by hashing the file’s contents, another helper method that constructs the path to a cached file, and a single abstract method called _add that actually adds a file to the cache. CacheBase doesn’t rely on how the index is implemented, which means we can defer making that decision until we absolutely have to. Equally, encapsulating the specifics of file storage in a single method allows us to do most of the implementation before figuring that out.

With CacheBase in place, we can create a class called CacheFilesystem that copies files to the local cache but doesn’t actually archive them anywhere:

class CacheFilesystem(CacheBase):
    def _add(self, identifier, local_path):
        cache_path = self._make_cache_path(identifier)
        if not cache_path.exists():
            shutil.copyfile(local_path, cache_path)

A file archiving system that doesn’t actually archive files isn’t particularly useful in practice, but it enables us to write some useful tests. Developers often create a minimum testable class for this reason. Just as creating an abstract base class that implements operations in terms of a small number of methods helps us break a design problem into smaller, more manageable pieces, building a minimum testable class enables us to check some things right away so that we can focus later testing effort on implementation-specific details.

By now our testing should look familiar. We create a fixture for the filesystem as a whole and another for the cache:

CACHE_DIR = Path("/cache")

@pytest.fixture
def disk(fs):
    fs.create_dir(CACHE_DIR)
    return fs

@pytest.fixture
def cache():
    return CacheFilesystem(IndexCSV(CACHE_DIR), CACHE_DIR)

and then write a test to check that if we haven’t ever added files to the cache, no files are present:

def test_filesystem_no_files_before_add(disk, cache):
    assert cache.known() == set()

We can then start testing that (for example) if we add two files with different names, the cache contains two files:

def test_filesystem_two_files_present_after_add(disk, cache):
    names = "ab"
    for name in names:
        filename = f"{name}.txt"
        disk.create_file(filename, contents=name)
        cache.add(filename)
    assert len(cache.known()) == 2

A Limited Cache

The next step is to archive all the files remotely and only cache a few locally. We will simulate remote storage using a second directory on our own machine, and limit the number of files in the cache rather than their total size; we’ll look at a size-based cache in the exercises.

Figure It Out Later

As soon as we start storing files on remote servers, we need to figure out how to authenticate the user, i.e., how to establish that they are who they say they are. Again, building our system piece by piece lets us sort out other details now and worry about those ones later. We are taking a risk, though: it’s possible that we won’t be able to extend our design to handle those extra requirements, but will instead have to tear it down and start over. Experience is the only reliable guide; the real aim of lessons like this one is to pass on experience so that the next person doesn’t have to step on all the landmines themselves.

Our new class CacheLimited needs:

an index to keep track of files;
the path to the cache directory;
the path to the archive directory that we’re using to simulate remote storage; and
the maximum number of files the cache is allowed to contain.

Its constructor is therefore:

class CacheLimited(CacheFilesystem):
    def __init__(self, index, cache_dir, archive_dir, local_limit):
        super().__init__(index, cache_dir)
        self.archive_dir = archive_dir
        self.local_limit = local_limit

At this point we are going to do something that we said earlier we wouldn’t need to do. CacheBase defines a method get_cache_path, and CacheFilesystem used that method without altering it. In contrast, our new CacheLimited class overrides it to ensure there’s a local copy of a file whenever we want to read it. In doing so, get_cache_path ensures that the cache has enough space for that file. More specifically, if we add a file when the cache is full, CacheLimited will:

delete a file from the cache;
add the new file to the cache; and
add it to archival storage.

    def get_cache_path(self, identifier):
        cache_path = super().get_cache_path(identifier)
        if not cache_path.exists():
            self._ensure_cache_space()
            archive_path = self._make_archive_path(identifier)
            shutil.copyfile(archive_path, cache_path)
        return cache_path

The method that ensures the cache has space for a new file checks the cache’s size and removes a file if necessary:

    def _ensure_cache_space(self):
        # Check size.
        cache_dir = Path(self.cache_dir)
        assert cache_dir.exists()
        cache_files = list(cache_dir.iterdir())
        if len(cache_files) < self.local_limit:
            return

        # Remove a file.
        choice = cache_files.pop()
        assert len(cache_files) < self.local_limit
        Path(choice).unlink()

We also need to implement _add to handle the case of adding an entirely new file

    def _add(self, identifier, local_path):
        self._add_archive(identifier, local_path)
        self._ensure_cache_space()
        super()._add(identifier, local_path)

The tests for CacheLimited look like those we’ve written before:

def test_limited_single_file_present_after_add(disk, cache):
    disk.create_file("test.txt", contents="xyz")
    ident = cache.add("test.txt")
    assert cache.has(ident)
    assert cache.known() == {ident}
    cache_path = cache.get_cache_path(ident)
    assert str(cache_path.parent) == str(CACHE_DIR)
    archive_path = cache._make_archive_path(ident)
    assert Path(archive_path).exists()

One difference is that they check the archive directory as well as the cache directory. To do this, the tests may use a method like _make_archive_path that isn’t intended for general consumption. Application programs should never do this, but it’s generally regarded as acceptable for tests (not least because the alternative is to make that “private” method generally available).

Placeholders

The last thing we need for a fully-functional repository-friendly file-caching system is a reliable way to create the placeholder files that take the place of actual data files in version control. Each file holds a cache entry identifier, so if we do our job properly, users will never need to see these or remember them.

Python provides an open function for opening files, so we will create a cache_open function to open cached files:

def cache_open(cache, filename):
    if not filename.endswith(cache.CACHE_SUFFIX):
        filename = f"{filename}.{CACHE_SUFFIX}"
    filename = Path(filename)
    if not filename.exists():
        raise FileNotFoundError(f"No local cache file {filename}")
    with open(filename, "r") as reader:
        identifier = reader.read().strip()
        cache_path = cache.get_cache_path(identifier)
        return open(cache_path, "r")

If the actual file is called a.txt, this function looks for a.txt.cache, reads the file identifier from it, uses that to find the path to the cached file, and opens that file for reading. We don’t allow users to open files for writing, since that would almost certainly put the file’s contents out of step with its identifier. This restriction is similar to Python’s requirement that dictionaries’ keys be immutable; we will explore alternatives in the exercises.

cache_save is the complement to cache_open. Given the name of a local file that the user wants to cache, this function saves it in the cache and constructs a placeholder file:

def cache_save(cache, filename):
    identifier = cache.add(filename)
    identifier_file = Path(f"{filename}.{cache.CACHE_SUFFIX}")
    with open(identifier_file, "w") as writer:
        writer.write(identifier)

These two functions work as designed, but neither should be used in production. The problem is that neither function works atomically: in both cases, if something goes wrong part-way through the function, the caching system can be left in an inconsistent state. While there’s no obvious way for these functions to fail mid-flight, there’s always the possibility of a power failure or of the user typing Control-C to interrupt the program at just the wrong spot. Tools like DVC do a lot of extra work to guarantee that this can’t happen.

Summary

Concept map for file cache — Figure 31.3: Concepts for file cache.

Exercises

An alternative index

Create a new class IndexJSON that stores the index as a JSON file instead of as CSV. You should be able to use your new class with the existing cache classes without changing the latter.

Another alternative index

Create a new class IndexSQLite that stores the index in a SQLite database instead of as CSV. You should be able to use your new class with the existing cache classes without changing the latter.

Consolidation

Write a tool that finds all the placeholder files in a project and copies the files they refer to from the cache into the project, deleting placeholders as it does so.

Cache size

Modify the cache so that it only stores files up to a specified total size.
How should the cache decide which files to delete when a new file would put it over its size limit?
What should the cache do if someone tries to add a file that is larger than the size limit?

Least recently used

Modify the cache and index to keep track of when files are used as well as when they are created.
Modify the cache cleanup code to delete least recently used file when extra space is needed.

Compressing files

Modify the system to compress files when they are archived. Use Python’s zipfile module to handle compression.
Modify the system again so that users can specify whether files should be compressed in both the archive and the cache, only compressed in the archive, or never compressed.

Reading cached files

Modify cache_open so that files can be opened in binary mode.
Modify it again so that the function can be used as a context manager in with statements.

Storing files remotely

Create a small web server (Chapter 22) that accepts file upload and file download requests.
Modify the cache so that it sends new files to the server and downloads files from the server on demand.

Commenting

Modify the system so that users can add comments about files in the local placeholder files.
Suppose two projects A and B are stored in separate version control repositories, but use the same data files. If comments are stored in placeholder files then comments saved in A won’t be visible in B and vice versa. How could you modify your solution to part 1 to enable inter-project sharing of comments?

Appending

Modify the system so that users can append data to existing files:

Each placeholder file stores the IDs of one or more cached files and the number of bytes in each.
When a user wants to read from a file, they may specify a byte offset from the start of the (logical) file where reading is to start.
When a user opens a file for appending, another chunk is created in the cache and writes are appended to it. When they close the file, an entry is added to the placeholder file with information about the new chunk.