A File Cache
- Software systems often use caches to store a subset of files in order to use less disk space.
- Caching systems can replace actual files with placeholders containing metadata.
- Object-oriented systems are often implemented in stages to break large design problems into smaller, more manageable ones.
- In a good design, derived classes only have to override a few (preferably none) of the methods implemented in parent classes.
- Implementing a minimum testable class allows early testing of core functionality.
Terms defined: authentication, minimum testable class, named tuple, placeholder file
Data scientists often want to analyze the same files in several projects. Those files might be too sensitive or too large to store in version control. Even when that’s not the case, storing several copies eventually results in someone analyzing an out-of-date version of the data.
Tools like Git LFS and DVC are an alternative. These tools replace the actual data file with a placeholder file that contains a unique ID, such as a hash of the data file’s contents (Figure 31.1). The placeholder file is small, so it can be committed to version control, and if the file it refers to changes, that change will be visible in the version control history.
The data file is stored on a server or in the cloud and downloaded on demand. Since the data file might be shared between several projects, it is usually not downloaded into any particular project. Instead, it is stored in a cache that all projects have access to. If the cache gets too large, files in it can be deleted and re-downloaded later.
This architecture has four parts: the permanent archive of data files, an index of all the files it contains, the cache of locally-available files, and the placeholder files that are actually saved in version control. There are several ways to implement each of these, so our implementation will give us a chance to explore some ideas in object-oriented programming.
A File Index
Let’s start by creating the index of archived files. We need to record each file’s unique identifier and the time it was added to the archive; we could define a class for this, but since instances of that class are just passive data records, we use a named tuple instead:
from collections import namedtuple
CacheEntry = namedtuple("CacheEntry", ["identifier", "timestamp"])
Next, we create an abstract class that defines the operation an index can do, but doesn’t actually implement them—or rather, only implements the ones that can be defined in terms of lower-level operations that derived class will implement:
class IndexBase(ABC):
TIME_FORMAT = "%Y-%m-%d:%H:%M:%S"
def __init__(self, index_dir=None):
self.set_index_dir(index_dir)
def get_index_dir(self):
return self.index_dir
def set_index_dir(self, index_dir):
self.index_dir = index_dir
self._initialize_index()
def has(self, identifier):
index = self.load()
return any(entry.identifier == identifier for entry in index)
def known(self):
return {entry.identifier for entry in self.load()}
def add(self, identifier):
index = self.load()
entry = CacheEntry(identifier, current_time())
index.append(entry)
self.save(index)
The methods of IndexBase
can:
- get and set the directory where the index file is stored;
- check if a particular file is known (using its identifier, not its name);
- get a list of all known files (again, by identifier rather than name); and
- add an identifier to the index with a timestamp.
These methods rely on three abstract methods to initialize the index if it doesn’t already exist, load the index, and save the index. Their definitions are:
@abstractmethod
def load(self):
"""Load entire index."""
@abstractmethod
def save(self, index):
"""Save entire index."""
@abstractmethod
def _initialize_index(self):
"""Initialize index file if necessary."""
Finally,
we define a function current_time
that returns a time we can use in an index record.
Doing this gives us something that will be easy to mock for testing:
def current_time():
return datetime.datetime.now()
Naming Things
The index stores files’ unique IDs rather than their names because the latter are unreliable. A file may be moved or renamed several times within a project, or may have different names in different projects; our design therefore relies on something intrinsic to the file rather than something under the user’s control.
The price for this decision is that files’ “names” are meaningless.
Whether we use sequence numbers or hashes,
remembering that health_records.csv
is file 177234
increases the cognitive load on our users.
By writing IndexBase
,
we have turned a large, amorphous design problem
into a smaller and more focused one
(Figure 31.2).
For example,
if we decide to store the index as a JSON file or in a SQLite database,
we should only need to provide the methods to handle that file format;
we should not need to re-write or re-test any of the other methods.
This rule gives us a way to evaluate our designs:
the fewer methods we have to override,
the better the original division of labor was.
To try it out,
let’s implement a concrete index class that stores data in a CSV file.
load
checks that the cache directory exists
and that it contains an index file,
then reads that file and turns its contents into CacheEntry
records:
class IndexCSV(IndexBase):
def load(self):
if not self.index_dir:
raise CacheException("Cache directory not set in index")
index_path = self._make_index_path()
if not index_path.exists():
raise CacheException(f"Index file {index_path} not found")
with open(index_path, "r") as stream:
reader = csv.reader(stream)
return [
CacheEntry(r[0], datetime.strptime(r[1], self.TIME_FORMAT))
for r in reader
]
Similarly,
save
checks that the cache directory exists
and then writes each record to the index file.
This method automatically creates the index file
if it doesn’t already exist:
def save(self, index):
if not self.index_dir:
raise CacheException("Cache directory not set in index")
index_path = self._make_index_path()
with open(index_path, "w") as stream:
writer = csv.writer(stream)
for entry in index:
when = entry.timestamp.strftime(self.TIME_FORMAT)
writer.writerow((entry.identifier, when))
Finally,
we need to write _initialize_index
,
which is called by the set_index_dir
method in the base class
rather than being invoked directly by the user.
While we’re doing that,
we’ll define the name for the index file
and write a helper method that produces its full path:
INDEX_FILE = "index.csv"
def _initialize_index(self):
self._make_index_path().touch()
def _make_index_path(self):
if not self.index_dir:
raise CacheException("Cache directory not set in index")
return Path(self.index_dir, self.INDEX_FILE)
Time to do some testing.
As in Chapter 10,
we will use pyfakefs instead of creating files on disk
and pytest’s fixture
decorator to set things up:
CACHE_DIR = Path("/cache")
INDEX_FILE = Path(CACHE_DIR, IndexCSV.INDEX_FILE)
LOCAL_DIR = Path(".")
@pytest.fixture
def disk(fs):
fs.create_dir(CACHE_DIR)
If we create a new cache, it should be empty:
def test_csv_loads_initially(disk):
index = IndexCSV(CACHE_DIR)
assert index.load() == []
assert index.known() == set()
If we add an entry, it should still be there when we reload the index:
def test_csv_saves_changes(disk):
right_now = datetime(2022, 12, 12)
index = IndexCSV(CACHE_DIR)
with patch("index_base.current_time", return_value=right_now):
index.add("abcd1234")
assert index.load() == [CacheEntry("abcd1234", right_now)]
assert index.known() == {"abcd1234"}
And if we check whether or not something is in the index, the answer should be “yes” for things we’ve added and “no” for things we haven’t:
def test_csv_has_entry(disk):
right_now = datetime(2022, 12, 12)
index = IndexCSV(CACHE_DIR)
with patch("index_base.current_time", return_value=right_now):
index.add("abcd1234")
assert index.has("abcd1234")
assert not index.has("dcba4321")
A Local Cache
Just as IndexBase
defined the behavior every cache index must have,
CacheBase
defines the required behavior of caches themselves.
As the listing below shows,
it is less than three dozen lines long:
class CacheBase(ABC):
CACHE_SUFFIX = "cache"
def __init__(self, index, cache_dir):
self.index = index
self.cache_dir = cache_dir
def add(self, local_path):
identifier = self._make_identifier(local_path)
self.index.add(identifier)
self._add(identifier, local_path)
return identifier
def get_cache_path(self, identifier):
if not self.index.has(identifier):
raise CacheException(f"Unknown file identifier {identifier}")
return self._make_cache_path(identifier)
def has(self, identifier):
return self.index.has(identifier)
def known(self):
return self.index.known()
def _make_identifier(self, local_path):
with open(local_path, "rb") as reader:
return hash_stream(reader)
def _make_cache_path(self, identifier):
return Path(self.cache_dir, f"{identifier}.{self.CACHE_SUFFIX}")
@abstractmethod
def _add(self, identifier, local_path):
"""Add a file with a given identifer from a given local path."""
CacheBase
encapsulates several design decisions.
Cached files are named identifier.cache
,
and are all stored in a single directory.
This scheme works well if the cache only ever has a few thousand files,
but if that number might grow into the tens or hundreds of thousands,
we might want to divide the cached files between several sub-directories
(which is what most browsers do).
All of CacheBase
’s operations are implemented in terms of
an index derived from IndexBase
,
a helper method that makes a file identifier by hashing the file’s contents,
another helper method that constructs the path to a cached file,
and a single abstract method called _add
that actually adds a file to the cache.
CacheBase
doesn’t rely on how the index is implemented,
which means we can defer making that decision until we absolutely have to.
Equally,
encapsulating the specifics of file storage in a single method
allows us to do most of the implementation
before figuring that out.
With CacheBase
in place,
we can create a class called CacheFilesystem
that copies files to the local cache
but doesn’t actually archive them anywhere:
class CacheFilesystem(CacheBase):
def _add(self, identifier, local_path):
cache_path = self._make_cache_path(identifier)
if not cache_path.exists():
shutil.copyfile(local_path, cache_path)
A file archiving system that doesn’t actually archive files isn’t particularly useful in practice, but it enables us to write some useful tests. Developers often create a minimum testable class for this reason. Just as creating an abstract base class that implements operations in terms of a small number of methods helps us break a design problem into smaller, more manageable pieces, building a minimum testable class enables us to check some things right away so that we can focus later testing effort on implementation-specific details.
By now our testing should look familiar. We create a fixture for the filesystem as a whole and another for the cache:
CACHE_DIR = Path("/cache")
@pytest.fixture
def disk(fs):
fs.create_dir(CACHE_DIR)
return fs
@pytest.fixture
def cache():
return CacheFilesystem(IndexCSV(CACHE_DIR), CACHE_DIR)
and then write a test to check that if we haven’t ever added files to the cache, no files are present:
def test_filesystem_no_files_before_add(disk, cache):
assert cache.known() == set()
We can then start testing that (for example) if we add two files with different names, the cache contains two files:
def test_filesystem_two_files_present_after_add(disk, cache):
names = "ab"
for name in names:
filename = f"{name}.txt"
disk.create_file(filename, contents=name)
cache.add(filename)
assert len(cache.known()) == 2
A Limited Cache
The next step is to archive all the files remotely and only cache a few locally. We will simulate remote storage using a second directory on our own machine, and limit the number of files in the cache rather than their total size; we’ll look at a size-based cache in the exercises.
Figure It Out Later
As soon as we start storing files on remote servers, we need to figure out how to authenticate the user, i.e., how to establish that they are who they say they are. Again, building our system piece by piece lets us sort out other details now and worry about those ones later. We are taking a risk, though: it’s possible that we won’t be able to extend our design to handle those extra requirements, but will instead have to tear it down and start over. Experience is the only reliable guide; the real aim of lessons like this one is to pass on experience so that the next person doesn’t have to step on all the landmines themselves.
Our new class CacheLimited
needs:
- an index to keep track of files;
- the path to the cache directory;
- the path to the archive directory that we’re using to simulate remote storage; and
- the maximum number of files the cache is allowed to contain.
Its constructor is therefore:
class CacheLimited(CacheFilesystem):
def __init__(self, index, cache_dir, archive_dir, local_limit):
super().__init__(index, cache_dir)
self.archive_dir = archive_dir
self.local_limit = local_limit
At this point we are going to do something that
we said earlier we wouldn’t need to do.
CacheBase
defines a method get_cache_path
,
and CacheFilesystem
used that method without altering it.
In contrast,
our new CacheLimited
class overrides it
to ensure there’s a local copy of a file
whenever we want to read it.
In doing so,
get_cache_path
ensures that
the cache has enough space for that file.
More specifically,
if we add a file when the cache is full,
CacheLimited
will:
- delete a file from the cache;
- add the new file to the cache; and
- add it to archival storage.
def get_cache_path(self, identifier):
cache_path = super().get_cache_path(identifier)
if not cache_path.exists():
self._ensure_cache_space()
archive_path = self._make_archive_path(identifier)
shutil.copyfile(archive_path, cache_path)
return cache_path
The method that ensures the cache has space for a new file checks the cache’s size and removes a file if necessary:
def _ensure_cache_space(self):
# Check size.
cache_dir = Path(self.cache_dir)
assert cache_dir.exists()
cache_files = list(cache_dir.iterdir())
if len(cache_files) < self.local_limit:
return
# Remove a file.
choice = cache_files.pop()
assert len(cache_files) < self.local_limit
Path(choice).unlink()
We also need to implement _add
to handle the case of adding an entirely new file
def _add(self, identifier, local_path):
self._add_archive(identifier, local_path)
self._ensure_cache_space()
super()._add(identifier, local_path)
The tests for CacheLimited
look like those we’ve written before:
def test_limited_single_file_present_after_add(disk, cache):
disk.create_file("test.txt", contents="xyz")
ident = cache.add("test.txt")
assert cache.has(ident)
assert cache.known() == {ident}
cache_path = cache.get_cache_path(ident)
assert str(cache_path.parent) == str(CACHE_DIR)
archive_path = cache._make_archive_path(ident)
assert Path(archive_path).exists()
One difference is that they check the archive directory
as well as the cache directory.
To do this,
the tests may use a method like _make_archive_path
that isn’t intended for general consumption.
Application programs should never do this,
but it’s generally regarded as acceptable for tests
(not least because the alternative is
to make that “private” method generally available).
Placeholders
The last thing we need for a fully-functional repository-friendly file-caching system is a reliable way to create the placeholder files that take the place of actual data files in version control. Each file holds a cache entry identifier, so if we do our job properly, users will never need to see these or remember them.
Python provides an open
function for opening files,
so we will create a cache_open
function to open cached files:
def cache_open(cache, filename):
if not filename.endswith(cache.CACHE_SUFFIX):
filename = f"{filename}.{CACHE_SUFFIX}"
filename = Path(filename)
if not filename.exists():
raise FileNotFoundError(f"No local cache file {filename}")
with open(filename, "r") as reader:
identifier = reader.read().strip()
cache_path = cache.get_cache_path(identifier)
return open(cache_path, "r")
If the actual file is called a.txt
,
this function looks for a.txt.cache
,
reads the file identifier from it,
uses that to find the path to the cached file,
and opens that file for reading.
We don’t allow users to open files for writing,
since that would almost certainly put the file’s contents
out of step with its identifier.
This restriction is similar to Python’s requirement that dictionaries’ keys be immutable;
we will explore alternatives in the exercises.
cache_save
is the complement to cache_open
.
Given the name of a local file that the user wants to cache,
this function saves it in the cache
and constructs a placeholder file:
def cache_save(cache, filename):
identifier = cache.add(filename)
identifier_file = Path(f"{filename}.{cache.CACHE_SUFFIX}")
with open(identifier_file, "w") as writer:
writer.write(identifier)
These two functions work as designed, but neither should be used in production. The problem is that neither function works atomically: in both cases, if something goes wrong part-way through the function, the caching system can be left in an inconsistent state. While there’s no obvious way for these functions to fail mid-flight, there’s always the possibility of a power failure or of the user typing Control-C to interrupt the program at just the wrong spot. Tools like DVC do a lot of extra work to guarantee that this can’t happen.
Summary
Exercises
An alternative index
Create a new class IndexJSON
that stores the index as a JSON file
instead of as CSV.
You should be able to use your new class with the existing cache classes
without changing the latter.
Another alternative index
Create a new class IndexSQLite
that stores the index in a SQLite database
instead of as CSV.
You should be able to use your new class with the existing cache classes
without changing the latter.
Consolidation
Write a tool that finds all the placeholder files in a project and copies the files they refer to from the cache into the project, deleting placeholders as it does so.
Cache size
- Modify the cache so that it only stores files up to a specified total size.
- How should the cache decide which files to delete when a new file would put it over its size limit?
- What should the cache do if someone tries to add a file that is larger than the size limit?
Least recently used
- Modify the cache and index to keep track of when files are used as well as when they are created.
- Modify the cache cleanup code to delete least recently used file when extra space is needed.
Compressing files
-
Modify the system to compress files when they are archived. Use Python’s zipfile module to handle compression.
-
Modify the system again so that users can specify whether files should be compressed in both the archive and the cache, only compressed in the archive, or never compressed.
Reading cached files
- Modify
cache_open
so that files can be opened in binary mode. - Modify it again so that the function can be used as a context manager
in
with
statements.
Storing files remotely
- Create a small web server (Chapter 22) that accepts file upload and file download requests.
- Modify the cache so that it sends new files to the server and downloads files from the server on demand.
Commenting
-
Modify the system so that users can add comments about files in the local placeholder files.
-
Suppose two projects A and B are stored in separate version control repositories, but use the same data files. If comments are stored in placeholder files then comments saved in A won’t be visible in B and vice versa. How could you modify your solution to part 1 to enable inter-project sharing of comments?
Appending
Modify the system so that users can append data to existing files:
-
Each placeholder file stores the IDs of one or more cached files and the number of bytes in each.
-
When a user wants to read from a file, they may specify a byte offset from the start of the (logical) file where reading is to start.
-
When a user opens a file for appending, another chunk is created in the cache and writes are appended to it. When they close the file, an entry is added to the placeholder file with information about the new chunk.