A Code Linter

A linter checks that a program conforms to a set of style and usage rules.
Linters typically use the Visitor design pattern to find nodes of interest in an abstract syntax tree.
Programs can modify a program’s AST and then unparse it to create modified versions of the original program.
Dynamic code modification is very powerful, but the technique can produce insecure and unmaintainable code.

This book relies on about 1800 lines of Python to turn Markdown into HTML, fill in cross-references, and so on. To keep that code readable, we use black, flake8, and isort to check that lines aren’t too long, that classes and functions have consistent names, that modules are imported in a consistent order, and dozens of other things.

Checking tools are often called linters because an early tool like this that found fluff in C programs was called lint. Many projects insist that code pass linting checks before being committed to version control. To show how linters work, this chapter builds a trio of tools that find duplicate keys in dictionaries, look for unused variables, and create a table showing which classes in a hierarchy define which methods.

Machinery

Chapter 11 represented HTML as a DOM tree. We can also represent the structure of a program as an abstract syntax tree (AST) whose nodes represent functions, statements, variables, array indexing operations, and so on.

Python’s ast module will parse Python source code and produce an AST for us. For example, Figure 13.1 shows key parts of the AST for the short program shown below:

def double(x):
    return 2 * x

result = double(3)
print(result)

Simple AST — Figure 13.1: The abstract syntax tree for a simple Python program.

We said “key parts of the AST” because the complete structure contains many details that we haven’t bothered to draw. To see them, let’s use ast.parse to turn our example code into an AST and ast.dump to display it:

import ast
import sys

with open(sys.argv[1], "r") as reader:
    source = reader.read()

tree = ast.parse(source)
print(ast.dump(tree, indent=4))

python dump_ast.py simple.py

Module(
    body=[
        FunctionDef(
            name='double',
            args=arguments(
                posonlyargs=[],
                args=[
                    arg(arg='x')],
                kwonlyargs=[],
                kw_defaults=[],
                defaults=[]),
            body=[
                Return(
                    value=BinOp(
                        left=Constant(value=2),
                        op=Mult(),
                        right=Name(id='x', ctx=Load())))],
            decorator_list=[]),
        Assign(
            targets=[
                Name(id='result', ctx=Store())],
            value=Call(
                func=Name(id='double', ctx=Load()),
                args=[
                    Constant(value=3)],
                keywords=[])),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='result', ctx=Load())],
                keywords=[]))],
    type_ignores=[])

The node representing the definition of the function double is a FunctionDef node with a name and an arguments sub-node that stores information about the function’s arguments; other nodes that we have left out represent its return value, the call to double, the assignment to result, and so on.

If we want a list of all the functions defined in this module, we can walk through this tree to find all the FunctionDef nodes and record their name properties. Since each node’s structure is a little different, we would have to write one function for each type of node that knew which fields of that node were worth exploring.

Luckily for us the ast module has tools to do this for us. The class ast.NodeVisitor uses the now-familiar Visitor design pattern to recurse through a structure like the one in Figure 13.1. Each time the visitor reaches a node of type Thing, it looks for a method called visit_Thing; for example, when it reaches a FunctionDef node it looks for visit_FunctionDef. If that method has been defined, NodeVisitor calls it with the node as an argument. The class CollectNames uses this machinery to create a list of the function and variable names defined in a program:

class CollectNames(ast.NodeVisitor):
    def __init__(self):
        super().__init__()
        self.names = {}

    def visit_Assign(self, node):
        for var in node.targets:
            self.add(var, var.id)
        self.generic_visit(node)

    def visit_FunctionDef(self, node):
        self.add(node, node.name)
        self.generic_visit(node)

    def add(self, node, name):
        loc = (node.lineno, node.col_offset)
        self.names[name] = self.names.get(name, set())
        self.names[name].add(loc)

    def position(self, node):
        return ({node.lineno}, {node.col_offset})

A few things worth noting about this class are:

The constructor of CollectNames invokes the constructor of NodeVisitor using super().__init__() before doing anything else.
The methods visit_Assign and visit_FunctionDef must call self.generic_visit(node) explicitly to recurse down through their children. By requiring this to be explicit, NodeVisitor gives programmers control on whether and when recursion takes place.
The method position relies on the fact that every node in the AST keeps track of where in the source code it came from.

To use this class, we read the source of the program that we want to analyze, parse it, and then call the visit method of our class to trigger recursion:

with open(sys.argv[1], "r") as reader:
    source = reader.read()
tree = ast.parse(source)
collector = CollectNames()
collector.visit(tree)
print(collector.names)

python walk_ast.py simple.py

{'double': {(1, 0)}, 'result': {(4, 0)}}

With a little more work we could record class names as well, and then check that (for example) class names use CamelCase, while function and variable names use pothole_case. We’ll tackle this in the exercises.

Finding Duplicate Keys

Many programs store their configuration in dictionaries. As those dictionaries grow larger, it’s easy for programmers to redefine values by accident. For example, the dictionary in this short piece of code has two entries for the key "third":

has_duplicates = {
    "third": 3,
    "fourth": 4,
    "fourth": 5,
    "third": 6
}
print(has_duplicates)

Python could treat this as an error, keep the first entry, keep the last entry, or concatenate the entries somehow. As the output below shows, it chooses the third option:

{'third': 6, 'fourth': 5}

We can build a linter that finds dictionaries like has_duplicates with just a few lines of code and the Counter class from Python’s collections module (which implements a specialized dictionary that counts how many times a key has been seen). We define a visit_Dict method for NodeVisitor that adds each constant key to the counter, then look for keys that have been seen more than once:

class FindDuplicateKeys(ast.NodeVisitor):
    def visit_Dict(self, node):
        seen = Counter()
        for key in node.keys:
            if isinstance(key, ast.Constant):
                seen[key.value] += 1
        problems = {k for (k, v) in seen.items() if v > 1}
        self.report(node, problems)
        self.generic_visit(node)

    def report(self, node, problems):
        if problems:
            msg = ", ".join(p for p in problems)
            print(f"duplicate key(s) {{{msg}}} at {node.lineno}")

When we parse has_duplicate_keys.py and pass the AST to FindDuplicateKeys, we get:

duplicate key(s) {fourth, third} at 1

As Far as We Can Go

FindDuplicateKeys only considers constant keys, which means it won’t find duplicate keys that are created on the fly like this:

def label():
    return "label"

actually_has_duplicate_keys = {
    "label": 1,
    "la" + "bel": 2,
    label(): 3,
    "".join(["l", "a", "b", "e", "l"]): 4,
}

We could try adding more code to handle this, but there are so many different ways to generate keys on the fly that our linter couldn’t possibly catch them all. The possibility of false negatives doesn’t mean that linting is useless, though: every problem that linting catches gives programmers more time to check for things that linters can’t find.

Finding Unused Variables

Finding unused variables—ones that are assigned values but never used—is more challenging than our previous examples. The problem is scope: a variable defined in a function or method might have the same name as one defined elsewhere, but they are different variables.

Let’s start by defining a class that handles variables in modules and functions. Since functions can be defined inside modules and other functions, the constructor for our class creates a list that we will use as a stack to keep track of what scopes we’re currently in:

class FindUnusedVariables(ast.NodeVisitor):
    def __init__(self):
        super().__init__()
        self.stack = []

    def visit_Module(self, node):
        self.search("global", node)

    def visit_FunctionDef(self, node):
        self.search(node.name, node)

We could just use a list of three values to record information for each scope, but using namedtuple (which also comes from Python’s collections module) tells readers explicitly what each scope consists of:

Scope = namedtuple("Scope", ["name", "load", "store"])

Each time we encounter a new scope we push a new Scope triple onto the stack with a name, a set to hold the variables that are used in the scope, and another set to hold the variables that are defined in the scope. We then call NodeVisitor.generic_visitor to trigger recursion, pop the record we just pushed off the stack, and report any problems:

    def search(self, name, node):
        self.stack.append(Scope(name, set(), set()))
        self.generic_visit(node)
        scope = self.stack.pop()
        self.check(scope)

    def check(self, scope):
        unused = scope.store - scope.load
        if unused:
            names = ", ".join(sorted(unused))
            print(f"unused in {scope.name}: {names}")

The last part of the puzzle is visit_Name. If the variable’s value is being read, the node will have a property .ctx (short for “context”) of type ast.Load. If the variable is being written to, the node’s .ctx property will be an instance of ast.Store. Checking this property allows us to put the name in the right set in the scope that’s at the top of the stack:

    def visit_Name(self, node):
        if isinstance(node.ctx, ast.Load):
            self.stack[-1].load.add(node.id)
        elif isinstance(node.ctx, ast.Store):
            self.stack[-1].store.add(node.id)
        else:
            assert False, "Unknown context"
        self.generic_visit(node)

Once again, we can run this by reading the source of a program, converting it to an AST, constructing an instance of FindUnusedVariables, and running its visit method:

with open(sys.argv[1], "r") as reader:
    source = reader.read()
tree = ast.parse(source)
finder = FindUnusedVariables()
finder.visit(tree)

To test our code, let’s create a program that has some unused variables:

used = 3
distractor = 2
not_used = used + distractor


def no_unused(param):
    result = 2 * param
    return result


def has_unused(param):
    used = 3 * param
    not_used = 2 * param
    distractor = "distraction"
    return used

When we run our linter we get:

unused in has_unused: distractor, not_used
unused in global: not_used

Summary

Figure 13.2 summarizes the ideas introduced in this chapter; please see Appendix B for some more related material.

Concept map for code manipulation — Figure 13.2: Concepts for code manipulation.

Exercises

Finding Unused Parameters

Modify the code that finds unused variables to report unused function parameters as well.

Finding Redundant Assignments

Write a linter that looks for redundant assignments to variables, i.e., assignments that are immediately overwritten:

x = 1  # redundant
x = 2

(Redundant assignments are a common result of copying and pasting.)

Checking Names

Write a linter that checks that class names are written in CamelCase but function and variable names are in pothole_case.

Missing Documentation

Write a linter that complains about modules, classes, methods, and functions that don’t have docstrings.

Missing Tests

Write a linter that takes two files as input: one that defines one or more functions and another that defines one or more tests of those functions. The linter looks through the tests to see what functions are being called, then reports any functions from the first file that it hasn’t seen.

Chaining Methods

Modify the code that injects methods into NodeVisitor so that any previously injected methods are also called.
Modify the methods again so that each one signals whether or not it has handled recursion (either directly or indirectly).

Sorting Imports

isort checks that the imports in a file are sorted correctly: modules from Python’s standard library come first (in alphabetical order), then installed modules (also in alphabetical order) and finally local imports (ditto). Write a linter that reports violations of these rules. How did you distinguish between the three cases?