A Package Manager

  • Software packages often have multiple versions, which are usually identified by multi-part semantic version numbers.
  • A package manager must find a mutually-compatible set of dependencies in order to install a package.
  • Finding a compatible set of packages is equivalent to searching a multi-dimensional space.
  • The work required to find a compatible set of packages can grow exponentially with the number of packages.
  • Eliminating partially-formed combinations of packages can reduce the work required to find a compatible set.
  • An automated theorem prover can determine if a set of logical propositions can be made consistent with each other.
  • Most package managers use some kind of theorem prover to find compatible sets of packages to install.

Terms defined: accumulator, backward-compatible, Boolean value, combinatorial explosion, cross product, model, patch, Recursive Enumeration pattern, scoring function, search space, semantic versioning

Inspired by the Comprehensive TeX Archive Network (CTAN), most languages have an online archive from which people can download packages, such as Python’s PyPI. Each package typically has a name, one or more versions, and a list of dependencies (which are also versioned). In order to install a package, we need to figure out exactly what versions of its dependencies to install: if A and B require different versions of C, we might not be able to use A and B together.

This chapter explores how to find a workable installation or prove that there isn’t one. It is based in part on this tutorial by Maël Nison and on Andreas Zeller’s lecture on academic prototyping; interested readers might also enjoy Michael Reim’s history of Unix packaging.

Semantic Versioning

Most software projects use semantic versioning for software releases. Each version is three integers X.Y.Z, where X is the major version, Y is the minor version, and Z is the patch.

A package’s authors increment its major version number when a change to the package breaks backward compatibility, i.e., if code built for the old version will fail or behave unpredictably with the new one. The minor version number is incremented when changes won’t break any existing code, and the patch number is changed for bug fixes that don’t add any new features.

The notation for specifying ranges of versions looks like arithmetic: >=1.2.3 means “any version from 1.2.3 onward”, <4 means “any version before 4.anything”, and 1.0-3.1 means “any version in the specified range (including patches)”. Note that version 2.1 is greater than version 1.99: no matter how large a minor version number becomes, it never spills over into the major version number.

It isn’t hard to compare simple semantic version identifiers, but handling the whole standard is almost as tricky as handling dates and times correctly. Our examples therefore number versions with plain integers; we recommend the semantic-version package for working with the real thing.

Exhaustive Search

To avoid messing around with parsers, we store the manifest of available packages as JSON:

{
  "A": {
    "3": {"B": ["3", "2"], "C": ["2"]},
    "2": {"B": ["2"],      "C": ["2", "1"]},
    "1": {"B": ["1"]}
  },
  "B": {
    "3": {"C": ["2"]},
    "2": {"C": ["1"]},
    "1": {"C": ["1"]}
  },
  "C": {
    "2": [],
    "1": []
  }
}

The keys in the main dictionary identify packages (which we’ve called A, B, and C for simplicity). Each package has a dictionary whose keys are version numbers, and each version has a dictionary showing which versions of which other packages are dependencies (Figure 20.1). It’s a complex data structure, but all of the detail is necessary.

Manifest structure
Figure 20.1: Structure of version dependency manifest.

Comments

We have been advising you since Chapter 5 not to design your own data format, but if you do, please include a single standard way for people to add comments. YAML has this, but JSON and CSV don’t.

Imagine that each package we need is an axis on a multi-dimensional grid (Figure 20.2), so each point on the grid is a possible combination of package versions. We can exclude regions of this grid using the constraints on the package versions; the points that are left represent legal combinations.

Allowable versions
Figure 20.2: Finding allowable combinations of package versions.

How much work is it to check all of these possibilities? Our example has \( 3×3×2=18 \) combinations. If we were to add another package to the mix with two versions, the search space would double; add another, and it would double again, which means that if each package has approximately \( c \) version and there are \( N \) packages, the work grows as \( O(c^N) \). This exponential behavior is called combinatorial explosion, and it makes brute-force solutions impractical even for small problems. We will implement it as a starting point (and to give us something to test more complicated solutions against), but then we will need to find a more efficient approach.

Reproducibility

There may not be a strong reason to prefer one mutually-compatible set of packages over another, but a package manager should resolve the ambiguity the same way every time. It might not be what everyone wants, but at least they will be unhappy for the same reasons everywhere. This is why pip list (and similar commands for other package managers) produce a listing of the exact versions of packages that have been installed: a spec written by a developer that lists allowed ranges of versions specifies what we want, while the listing created by the package manager specifies exactly what we got. If we want to reproduce someone else’s setup for debugging purposes, we should install what is described in the latter file.

Our brute-force program generates all possible combinations of package versions, then eliminates ones that aren’t compatible with the manifest. Its main body is just those steps in order with a few print statements to show the results:

def main():
    manifest = json.load(sys.stdin)
    possible = make_possibilities(manifest)
    print(f"{len(possible)} possibilities")
    allowed = [p for p in possible if compatible(manifest, p)]
    print(f"{len(allowed)} allowed")
    for a in allowed:
        print(a)

To generate the possibilities, we create a list of the available versions of each package, then use Python’s itertools module to generate the cross product that contains all possible combinations of items (Figure 20.3):

def make_possibilities(manifest):
    available = []
    for package, versions in manifest.items():
        available.append([(package, v) for v in versions])
    return list(itertools.product(*available))
Generating a cross product
Figure 20.3: Generating all possible combinations of items.

To check a candidate against the manifest, we compare every entry X against every other entry Y:

  1. If X and Y are the same package, we keep looking. We need this rule because we’re comparing every entry against every entry, which means we’re comparing package versions to themselves. We could avoid this redundant check by writing a slightly smarter loop, but there’s no point optimizing a horribly inefficient algorithm.

  2. If package X’s requirements say nothing about package Y, we keep searching. This rule handles the case of X not caring about Y, but it’s also the reason we need to compare all against all, since Y might care about X.

  3. Finally, if X does depend on Y, but this particular version of X doesn’t list this particular version of Y as a dependency, we can rule out this combination.

  4. If we haven’t ruled out a candidate after doing all these checks, we add it to the list of allowed configurations.

Sure enough, these rules find 3 valid combinations among our 18 possibilities:

def compatible(manifest, combination):
    for package_i, version_i in combination:
        lookup_i = manifest[package_i][version_i]
        for package_j, version_j in combination:
            if package_i == package_j:
                continue
            if package_j not in lookup_i:
                continue
            if version_j not in lookup_i[package_j]:
                return False
    return True
18 possibilities
3 allowed
(('A', '3'), ('B', '3'), ('C', '2'))
(('A', '2'), ('B', '2'), ('C', '1'))
(('A', '1'), ('B', '1'), ('C', '1'))

Generating Possibilities Manually

Our brute-force code uses itertools.product to generate all possible combinations of several lists of items. To see how it works, let’s rewrite make_possibilities to use a function of our own:

def make_possibilities(manifest):
    available = []
    for package, versions in manifest.items():
        available.append([(package, v) for v in versions])

    accum = []
    _make_possible(available, [], accum)
    return accum

The first half creates the same list of lists as before, where each sub-list is the available versions of a single package. It then creates an empty accumulator to collect all the combinations and calls a recursive function called _make_possible to fill it in.

Each call to _make_possible handles one package’s worth of work (Figure 20.4). If the package is X, the function loops over the available versions of X, adds that version to the combination in progress, and calls itself with the remaining lists of versions. If there aren’t any more lists to loop over, the recursive calls must have included exactly one version of each package, so the combination is appended to the accumulator.

def _make_possible(remaining, current, accum):
    if not remaining:
        accum.append(current)
    else:
        head, tail = remaining[0], remaining[1:]
        for h in head:
            _make_possible(tail, current + [h], accum)
18 possibilities
3 allowed
[('A', '3'), ('B', '3'), ('C', '2')]
[('A', '2'), ('B', '2'), ('C', '1')]
[('A', '1'), ('B', '1'), ('C', '1')]
Generating a cross product recursively
Figure 20.4: Generating all possible combinations of items recursively.

_make_possible uses recursion instead of nested loops because we don’t know how many loops to write. If we knew the manifest only contained three packages, we would write a triply-nested loop to generate combinations, but if there were four, we would need a quadruply-nested loop, and so on. This Recursive Enumeration design pattern uses one recursive function call per loop so that we automatically get exactly as many loops as we need.

Incremental Search

Generating an exponentiality of combinations and then throwing most of them away is inefficient. Instead, we can modify the recursive generator to stop if a partially-generated combination of packages isn’t legal. Combining generation and checking made the code more complicated, but as we’ll see, it leads to some significant improvements.

The main function for our modified program is similar to its predecessor. After loading the manifest, we generate a list of all package names. Unlike our earlier code, the entries in this list don’t include versions because we’re going to be checking those as we go:

def main():
    manifest = json.load(sys.stdin)
    packages = list(manifest.keys())
    if len(sys.argv) > 1:
        packages.reverse()

    accum = []
    count = find(manifest, packages, accum, [], 0)

    print(f"count {count}")
    for a in accum:
        print(a)

Notice that we reverse the list of packages before starting our search if the user provides an extra command-line argument. We’ll use this to see how ordering affects efficiency.

Our find function now has five parameters:

  1. The manifest that tells us what’s compatible with what.

  2. The names of the packages we haven’t considered yet.

  3. An accumulator to hold all the valid combinations we’ve found so far.

  4. The partially-completed combination we’re going to extend next.

  5. A count of the number of combinations we’ve considered so far, which we will use as a measure of efficiency.

def find(manifest, remaining, accum, current, count):
    count += 1
    if not remaining:
        accum.append(current)
    else:
        head, tail = remaining[0], remaining[1:]
        for version in manifest[head]:
            candidate = current + [(head, version)]
            if compatible(manifest, candidate):
                count = find(
                    manifest, tail, accum, candidate, count
                )
    return count

The algorithm combines the generation and checking we’ve already written:

  1. If there are no packages left to consider—i.e., if remaining is an empty list—then what we’ve built so far in current must be valid, so we append it to accumulator.

  2. Otherwise, we put the next package to consider in head and all the remaining packages in tail. We then check each version of the head package in turn. If adding it to the current collection of packages wouldn’t cause a problem, we continue searching with that version in place.

How much work does incremental checking save us? Using the same test case as before, we only create 11 candidates instead of 18, so we’ve reduced our search by about a third:

python incremental.py < triple.json
count 11
[('A', '3'), ('B', '3'), ('C', '2')]
[('A', '2'), ('B', '2'), ('C', '1')]
[('A', '1'), ('B', '1'), ('C', '1')]

If we reverse the order in which we search, though, we only generate half as many candidates as before:

python incremental.py reversed < triple.json
count 9
[('C', '2'), ('B', '3'), ('A', '3')]
[('C', '1'), ('B', '2'), ('A', '2')]
[('C', '1'), ('B', '1'), ('A', '1')]

Using a Theorem Prover

Cutting the amount of work we have to do is good: can we do better? The answer is yes, but the algorithms involved are complicated and the jargon almost impenetrable. To give you a taste of how they work, we will solve our example problem using the Z3 theorem prover.

Installing packages and proving theorems may not seem to have a lot to do with each other, but an automated theorem prover’s purpose is to determine how to make a set of logical propositions consistent with each other, or to prove that doing so is impossible. If we frame our problem as, “Is there a choice of package versions that satisfies all the inter-package dependencies at once?”, then a theorem prover is exactly what we need.

To start, let’s import a few things from z3 and create three Boolean variables:

from z3 import Bool

A = Bool("A")
B = Bool("B")
C = Bool("C")

Our three variables don’t have values yet—they’re not either true or false. Instead, each one represents all the possible states a Boolean could be in. If we had asked z3 to create one of its special integers, it would have given us something that initially encompassed all possible integer values.

Instead of assigning values to A, B, and C, we specify constraints on them, then ask z3 whether it’s possible to find a set of values, or model, that satisfies all those constraints at once. For example, we can ask whether it’s possible for A to equal B and B to equal C at the same time. The answer is “yes”, and the solution the solver finds is to make them all False:

solver = Solver()
solver.add(A == B)
solver.add(B == C)
report("A == B & B == C", solver.check())
A == B & B == C: sat
A False
B False
C False

What if we say that A and B must be equal, but B and C must be unequal? In this case, the solver finds a solution in which A and B are True but C is False:

solver = Solver()
solver.add(A == B)
solver.add(B != C)
report("A == B & B != C", solver.check())
A == B & B != C: sat
A True
B True
C False

Finally, what if we require A to equal B and B to equal C but A and C to be unequal? No assignment of values to the three variables can satisfy all three constraints at once, and the solver duly tells us that:

solver = Solver()
solver.add(A == B)
solver.add(B == C)
solver.add(A != C)
report("A == B & B == C & B != C", solver.check())
A == B & B == C & B != C: unsat

Returning to package management, we can represent the versions from our running example like this:

A1 = Bool("A.1")
A2 = Bool("A.2")
A3 = Bool("A.3")

B1 = Bool("B.1")
B2 = Bool("B.2")
B3 = Bool("B.3")

C1 = Bool("C.1")
C2 = Bool("C.2")

We then tell the solver that we want one of the available versions of package A:

solver = Solver()
solver.add(Or(A1, A2, A3))

and that the three versions of package A are mutually exclusive:

solver.add(Implies(A1, Not(Or(A2, A3))))
solver.add(Implies(A2, Not(Or(A1, A3))))
solver.add(Implies(A3, Not(Or(A1, A2))))

We need equivalent statements for packages B and C; we’ll explore in the exercises how to generate all of these from a package manifest.

Finally, we add the inter-package dependencies and search for a result:

solver.add(Implies(A3, And(Or(B3, B2), C2)))
solver.add(Implies(A2, And(B2, Or(C2, C1))))
solver.add(Implies(A1, B1))
solver.add(Implies(B3, C2))
solver.add(Implies(B2, C1))
solver.add(Implies(B1, C1))

print("result", solver.check(), solver.model())
result sat [B.3 = True,
 A.1 = False,
 C.2 = True,
 C.1 = False,
 B.2 = False,
 A.3 = True,
 A.2 = False,
 B.1 = False]

The output tells us that the combination of A.3, B.3, and C.2 will satisfy our constraints.

We saw earlier, though, that there are three solutions to our constraints. One way to find the others is to ask the solver to solve the problem again with the initial solution ruled out. We can repeat the process many times, adding “not the latest solution” to the constraints each time until the problem becomes unsolvable:

everything = [A1, A2, A3, B1, B2, B3, C1, C2]
while solver.check() == sat:
    model = solver.model()
    print([var for var in model.decls() if model[var]])
    settings = [var == model[var] for var in everything]
    cond = Not(And(*settings))
    solver.add(cond)
[B.3, C.2, A.3]
[C.1, B.2, A.2]
[A.1, C.1, B.1]

Summary

Figure 20.5 summarizes the key ideas introduced in this chapter. The most important thing to take away is that modern theorem provers can solve many more problems than most programmers realize. While formulating problems in ways that theorem provers understand can be challenging, solving those problems ourselves is usually much harder.

Concept map for package manager.
Figure 20.5: Concepts for package manager.

Exercises

Comparing Semantic Versions

Write a function that takes an array of semantic version specifiers and sorts them in ascending order. Remember that 2.1 is greater than 1.99.

Parsing Semantic Versions

Write a parser for a subset of the semantic versioning specification.

Using Scoring Functions

Many different combinations of package versions can be mutually compatible. One way to decide which actual combination to install is to create a scoring function that measures how good or bad a particular combination is. For example, a function could measure the “distance” between two versions as:

  • 100 times the difference in major version numbers;

  • 10 times the difference in minor version numbers if the major numbers agree; and

  • the difference in the patch numbers if both major and minor numbers agree.

Implement this function and use it to measure the total distance between the set of packages found by the solver and the set containing the most recent version of each package. Does it actually solve the original problem?

Regular Releases

Some packages release new versions regularly, e.g., Version 2023.1 is released on March 1 of 2023, Version 2023.2 is released on September 1 of that year, version 2024.1 is released on March 1 of the following year, and so on.

  1. How does this make package management easier?

  2. How does it make it more difficult?

Searching Least First

Rewrite the constraint solver so that it searches packages by looking at those with the fewest available versions first. Does this reduce the amount of work done for the small examples in this chapter? Does it reduce the amount of work done for larger examples?

Using Exclusions

  1. Modify the constraint solver so that it uses a list of package exclusions instead of a list of package requirements, i.e., its input tells it that version 1.2 of package Red can not work with versions 3.1 and 3.2 of package Green (which implies that Red 1.2 can work with any other versions of Green).

  2. Explain why package managers aren’t built this way.

Generating Constraints

Write a function that reads a JSON manifest describing package compatibilities and generates the constraints needed by the Z3 theorem prover.

Buildability

  1. Convert the build dependencies from one of the examples in Chapter 19 to a set of constraints for Z3 and use the solution to find a legal build order.

  2. Modify the constraints to introduce a circular dependency and check that the solver correctly determines that there is no legal build order.