A Virtual Machine

Every computer has a processor with a particular instruction set, some registers, and memory.
Instructions are just numbers but may be represented as assembly code.
Instructions may refer to registers, memory, both, or neither.
A processor usually executes instructions in order but may jump to another location based on whether a conditional is true or false.

Terms defined: Application Binary Interface, assembler, assembly code, bytecode, conditional jump, disassembler, instruction pointer, instruction set, label (of address in memory), op code, register (in hardware), virtual machine, word (of memory)

The interpreter in Chapter 7 relied on Python to do most of the actual work. The standard version of Python is implemented in C and relies on C’s operators to add numbers, index arrays, and so on, but C is compiled to instructions for a particular processor. Each operation in the little language of Chapter 7 is therefore expanded by several layers of software to become something that hardware can actually run. To show how that lower layer works, this chapter builds a simulator of a small computer. If you want to dive deeper into programming at this level, have a look at the game Human Resource Machine.

Architecture

Our virtual machine (VM) simulates a computer with three parts (Figure 25.1):

The instruction pointer (IP) holds the memory address of the next instruction to execute. It is automatically initialized to point at address 0, so that is where every program must start. This requirement is part of our VM’s Application Binary Interface (ABI).
Four registers named R0 to R3 that instructions can access directly. There are no memory-to-memory operations in our VM: everything happens in or through registers.
256 words of memory, each of which can store a single value. Both the program and its data live in this single block of memory; we chose the size 256 so that the address of each word will fit in a single byte.

Virtual machine architecture — Figure 25.1: Architecture of the virtual machine.

Our processor’s instruction set defines what it can do. Instructions are just numbers, but we will write them in a simple text format called assembly code that gives those number human-readable names.

The instructions for our VM are 3 bytes long. The op code fits in one byte, and each instruction may include zero, one, or two single-byte operands. (Instructions are sometimes called bytecode, since they’re packed into bytes, but so is everything else in a computer.)

Each operand is a register identifier, a constant, or an address, which is just a constant that identifies a location in memory. Since constants have to fit in one byte, this means that the largest number we can represent directly is 256. Table 25.1 uses the letters r and v to indicate instruction format, where r indicates a register identifier and v indicates a constant value.

Table 25.1: Virtual machine op codes.
Name	Code	Format	Action	Example	Equivalent
`hlt`	1	`--`	Halt program	`hlt`	`sys.exit(0)`
`ldc`	2	`rv`	Load constant	`ldc R0 99`	`R0 = 99`
`ldr`	3	`rr`	Load register	`ldr R0 R1`	`R0 = memory[R1]`
`cpy`	4	`rr`	Copy register	`cpy R0 R1`	`R0 = R1`
`str`	5	`rr`	Store register	`str R0 R1`	`memory[R1] = R0`
`add`	6	`rr`	Add	`add R0 R1`	`R0 = R0 + R1`
`sub`	7	`rr`	Subtract	`sub R0 R1`	`R0 = R0 - R1`
`beq`	8	`rv`	Branch if equal	`beq R0 99`	`if (R0==0) IP = 99`
`bne`	9	`rv`	Branch if not equal	`bne R0 99`	`if (R0!=0) IP = 99`
`prr`	10	`r-`	Print register	`prr R0`	`print(R0)`
`prm`	11	`r-`	Print memory	`prm R0`	`print(memory[R0])`

To start building our virtual machine, we put the VM’s details in a file that can be loaded by other modules:

NUM_REG = 4  # number of registers
RAM_LEN = 256  # number of words in RAM

OPS = {
    "hlt": {"code": 0x1, "fmt": "--"},  # Halt program
    "ldc": {"code": 0x2, "fmt": "rv"},  # Load value
    "ldr": {"code": 0x3, "fmt": "rr"},  # Load register
    "cpy": {"code": 0x4, "fmt": "rr"},  # Copy register
    "str": {"code": 0x5, "fmt": "rr"},  # Store register
    "add": {"code": 0x6, "fmt": "rr"},  # Add
    "sub": {"code": 0x7, "fmt": "rr"},  # Subtract
    "beq": {"code": 0x8, "fmt": "rv"},  # Branch if equal
    "bne": {"code": 0x9, "fmt": "rv"},  # Branch if not equal
    "prr": {"code": 0xA, "fmt": "r-"},  # Print register
    "prm": {"code": 0xB, "fmt": "r-"},  # Print memory
}

OP_MASK = 0xFF  # select a single byte
OP_SHIFT = 8  # shift up by one byte
OP_WIDTH = 6  # op width in characters when printing

There isn’t a name for this design pattern, but putting all the constants that define a system in one file instead of scattering them across multiple files makes them easier to find as well as ensuring consistency.

Execution

We start by defining a class with an instruction pointer, some registers, and some memory along with a prompt for output. A program is just an array of numbers representing instructions. To load a program into our VM, we copy those numbers into memory and reset the instruction pointer and registers:

class VirtualMachine:
    def __init__(self):
        self.initialize([])
        self.prompt = ">>"

    def initialize(self, program):
        assert len(program) <= RAM_LEN, "Program too long"
        self.ram = [
            program[i] if (i < len(program)) else 0
            for i in range(RAM_LEN)
        ]
        self.ip = 0
        self.reg = [0] * NUM_REG

Notice that the VM’s constructor calls initialize with an empty array (i.e., a program with no instructions) to do initial setup. If an object has a method to reset or reinitialize itself, having its constructor use that method is a way to avoid duplicating code.

To execute the next instruction, the VM gets the value in memory that the instruction pointer currently refers to and moves the instruction pointer on by one address. It then uses bitwise operations (Chapter 17) to extract the op code and operands from the instruction (Figure 25.2).

    def fetch(self):
        instruction = self.ram[self.ip]
        self.ip += 1
        op = instruction & OP_MASK
        instruction >>= OP_SHIFT
        arg0 = instruction & OP_MASK
        instruction >>= OP_SHIFT
        arg1 = instruction & OP_MASK
        return [op, arg0, arg1]

Unpacking instructions — Figure 25.2: Using bitwise operations to unpack instructions.

We always unpack two operands regardless of whether the instructions have them or not, since this is what most hardware implementations would do.

Processor Design

Some processors do have variable-length instructions, but they make the hardware more complicated and therefore slower. To decide whether these costs are worth paying, engineers rely on simulation and profiling (Chapter 15). Backward compatibility is also an issue: if earlier processors supported variable-length instructions, later ones must somehow do so as well in order to run old programs.

The next step is to add a run method to our VM that fetches instructions and executes them until told to stop:

    def run(self):
        running = True
        while running:
            op, arg0, arg1 = self.fetch()
            if op == OPS["hlt"]["code"]:
                running = False
            elif op == OPS["ldc"]["code"]:
                self.reg[arg0] = arg1
            elif op == OPS["ldr"]["code"]:
                self.reg[arg0] = self.ram[self.reg[arg1]]
            elif op == OPS["cpy"]["code"]:
                self.reg[arg0] = self.reg[arg1]

            else:
                assert False, f"Unknown op {op:06x}"

Let’s look more closely at three of these instructions. The first, str, stores the value of one register in the address held by another register:

            elif op == OPS["str"]["code"]:
                self.ram[self.reg[arg1]] = self.reg[arg0]

Adding the value in one register to the value in another register is simpler:

            elif op == OPS["add"]["code"]:
                self.reg[arg0] += self.reg[arg1]

as is jumping to a fixed address if the value in a register is zero. This conditional jump instruction is how we implement if:

            elif op == OPS["beq"]["code"]:
                if self.reg[arg0] == 0:
                    self.ip = arg1

Assembly Code

We could write out numerical op codes by hand just as early programmers did. However, it is much easier to use an assembler, which is just a small compiler for a language that very closely represents actual machine instructions.

Each command in our assembly languages matches an instruction in the VM. Here’s an assembly language program to print the value stored in R1 and then halt:

# Print initial contents of R1.
prr R1
hlt

Its numeric representation (in hexadecimal) is:

00010a
000001

One thing the assembly language has that the instruction set doesn’t is labels on addresses in memory. The label loop doesn’t take up any space; instead, it tells the assembler to give the address of the next instruction a name so that we can refer to @loop in jump instructions. For example, this program prints the numbers from 0 to 2 (Figure 25.3):

# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt

Counting from 0 to 2 — Figure 25.3: Flowchart of assembly language program to count up from 0 to 2.

Let’s trace this program’s execution (Figure 25.4):

R0 holds the current loop index.
R1 holds the loop’s upper bound (in this case 3).
The loop prints the value of R0 (one instruction).
The program adds 1 to R0. This takes two instructions because we can only add register-to-register.
It checks to see if we should loop again, which takes three instructions.
If the program doesn’t jump back, it halts.

Trace counting program — Figure 25.4: Tracing registers and memory values for a simple counting program.

The implementation of the assembler mirrors the simplicity of assembly language. The main method gets interesting lines, finds the addresses of labels, and turns each remaining line into an instruction:

class Assembler:
    def assemble(self, lines):
        lines = self._get_lines(lines)
        labels = self._find_labels(lines)
        instructions = [
            ln for ln in lines if not self._is_label(ln)
        ]
        compiled = [
            self._compile(instr, labels) for instr in instructions
        ]
        program = self._to_text(compiled)
        return program

To find labels, we go through the lines one by one and either save the label or increment the current address (because labels don’t take up space):

    def _find_labels(self, lines):
        result = {}
        loc = 0
        for ln in lines:
            if self._is_label(ln):
                label = ln[:-1].strip()
                assert label not in result, f"Duplicated {label}"
                result[label] = loc
            else:
                loc += 1
        return result

    def _is_label(self, line):
        return line.endswith(":")

To compile a single instruction we break the line into pieces, look up the format for the operands, and pack the values:

    def _compile(self, instruction, labels):
        tokens = instruction.split()
        op, args = tokens[0], tokens[1:]
        fmt, code = OPS[op]["fmt"], OPS[op]["code"]

        if fmt == "--":
            return self._combine(code)

        elif fmt == "r-":
            return self._combine(self._reg(args[0]), code)

        elif fmt == "rr":
            return self._combine(
                self._reg(args[1]), self._reg(args[0]), code
            )

        elif fmt == "rv":
            return self._combine(
                self._val(args[1], labels),
                self._reg(args[0]), code
            )

To convert a value, we either look up the label’s address (if the value starts with @) or convert the value to a number:

    def _val(self, token, labels):
        if token[0] != "@":
            return int(token)
        lbl = token[1:]
        assert lbl in labels, f"Unknown label '{token}'"
        return labels[lbl]

Combining op codes and operands into a single value is the reverse of the unpacking done by the virtual machine:

    def _combine(self, *args):
        assert len(args) > 0, "Cannot combine no arguments"
        result = 0
        for a in args:
            result <<= OP_SHIFT
            result |= a
        return result

As a test, this program counts up to 3:

# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt

>> 0
>> 1
>> 2
R000000 = 000003
R000001 = 000003
R000002 = 000000
R000003 = 000000
000000:   000002  030102  00000a  010202
000004:   020006  010204  000207  020209
000008:   000001  000000  000000  000000

Arrays

It’s tedious to write programs when each value needs a unique name. We can do a lot more once we have arrays, so let’s add those to our assembler. We don’t have to make any changes to the virtual machine, which doesn’t care if we think of a bunch of numbers as individuals or elements of an array, but we do need a way to create arrays and refer to them.

We will allocate storage for arrays at the end of the program by using .data on a line of its own to mark the start of the data section and then label: number to give a region a name and allocate some storage space (Figure 25.5).

Array storage allocation — Figure 25.5: Allocating storage for arrays in the virtual machine.

This enhancement only requires a few changes to the assembler. First, we need to split the lines into instructions and data allocations:

    DIVIDER = ".data"

    def assemble(self, lines):
        lines = self._get_lines(lines)
        to_compile, to_allocate = self._split(lines)

        labels = self._find_labels(lines)
        instructions = [ln for ln in to_compile if not self._is_label(ln)]

        base_of_data = len(instructions)
        self._add_allocations(base_of_data, labels, to_allocate)

        compiled = [self._compile(instr, labels) for instr in instructions]
        program = self._to_text(compiled)
        return program

    def _split(self, lines):
        try:
            split = lines.index(self.DIVIDER)
            return lines[0:split], lines[split + 1:]
        except ValueError:
            return lines, []

Second, we need to figure out where each allocation lies and create a label accordingly:

    def _add_allocations(self, base_of_data, labels, to_allocate):
        for alloc in to_allocate:
            fields = [a.strip() for a in alloc.split(":")]
            assert len(fields) == 2, f"Invalid allocation directive '{alloc}'"
            lbl, num_words_text = fields
            assert lbl not in labels, f"Duplicate label '{lbl}' in allocation"
            num_words = int(num_words_text)
            assert (base_of_data + num_words) < RAM_LEN, \
                f"Allocation '{lbl}' requires too much memory"
            labels[lbl] = base_of_data
            base_of_data += num_words

And that’s it: no other changes are needed to either compilation or execution. To test it, let’s fill an array with the numbers from 0 to 3:

# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
# - R2: array index.
# - R3: temporary.
ldc R0 0
ldc R1 3
ldc R2 @array
loop:
str R0 R2
ldc R3 1
add R0 R3
add R2 R3
cpy R3 R1
sub R3 R0
bne R3 @loop
hlt
.data
array: 10

R000000 = 000003
R000001 = 000003
R000002 = 00000e
R000003 = 000000
000000:   000002  030102  0b0202  020005
000004:   010302  030006  030206  010304
000008:   000307  030309  000001  000000
00000c:   000001  000002  000000  000000

Summary

Figure 25.6 summarizes the new ideas in this chapter. Real processors and the VMs for languages like Python are more complex, but experience shows that keeping things simple makes it much easier to make them fast and reliable.

Figure 25.6: Concept map for virtual machine and assembler.

Exercises

Swapping Values

Write an assembly language program that swaps the values in R1 and R2 without affecting the values in other registers.

Reversing an Array

Write an assembly language program that starts with:

the base address of an array in one word
the length of the array N in the next word
N values immediately thereafter

and that reverses the array in place.

Increment and Decrement

Add instructions inc and dec that add one to the value of a register and subtract one from the value of a register respectively.
Rewrite the examples to use these instructions. How much shorter do they make the programs? Do they make it easier to read?

Using Long Addresses

Modify the virtual machine so that the ldr and str instructions contain 16-bit addresses rather than 8-bit addresses and increase the virtual machine’s memory to 64K words to match.
How does this complicate instruction interpretation?

Operating on Strings

The C programming language stored character strings as non-zero bytes terminated by a byte containing zero.

Write a program that starts with the base address of a string in R1 and finishes with the length of the string (not including the terminator) in the same register.
Write a program that starts with the base address of a string in R1 and the base address of some other block of memory in R2 and copies the string to that new location (including the terminator).
What happens in each case if the terminator is missing?

Call and Return

Add another register to the virtual machine called SP (for “stack pointer”) that is automatically initialized to the last address in memory.
Add an instruction psh (short for “push”) that copies a value from a register to the address stored in SP and then subtracts one from SP.
Add an instruction pop (short for “pop”) that adds one to SP and then copies a value from that address into a register.
Using these instructions, write a subroutine that evaluates 2x+1 for every value in an array.

Disassembling Instructions

A disassembler turns machine instructions into assembly code. Write a disassembler for the instruction set used by our virtual machine. (Since the labels for addresses are not stored in machine instructions, disassemblers typically generate labels like @L001 and @L002.)

Linking Multiple Files

Modify the assembler to handle .include filename directives.
What does your modified assembler do about duplicate label names? How does it prevent infinite includes (i.e., A.as includes B.as which includes A.as again)?

Providing System Calls

Modify the virtual machine so that developers can add “system calls” to it.

On startup, the virtual machine loads an array of functions defined in a file called syscalls.py.
The sys instruction takes a one-byte constant argument. It looks up the corresponding function and calls it with the values of R0-R3 as arguments and places the result in R0.

Unit Testing

Write unit tests for the assembler.
Once they are working, write unit tests for the virtual machine.