A Virtual Machine
- Every computer has a processor with a particular instruction set, some registers, and memory.
- Instructions are just numbers but may be represented as assembly code.
- Instructions may refer to registers, memory, both, or neither.
- A processor usually executes instructions in order but may jump to another location based on whether a conditional is true or false.
Terms defined: Application Binary Interface, assembler, assembly code, bytecode, conditional jump, disassembler, instruction pointer, instruction set, label (of address in memory), op code, register (in hardware), virtual machine, word (of memory)
The interpreter in Chapter 7 relied on Python to do most of the actual work. The standard version of Python is implemented in C and relies on C’s operators to add numbers, index arrays, and so on, but C is compiled to instructions for a particular processor. Each operation in the little language of Chapter 7 is therefore expanded by several layers of software to become something that hardware can actually run. To show how that lower layer works, this chapter builds a simulator of a small computer. If you want to dive deeper into programming at this level, have a look at the game Human Resource Machine.
Architecture
Our virtual machine (VM) simulates a computer with three parts (Figure 25.1):
-
The instruction pointer (IP) holds the memory address of the next instruction to execute. It is automatically initialized to point at address 0, so that is where every program must start. This requirement is part of our VM’s Application Binary Interface (ABI).
-
Four registers named R0 to R3 that instructions can access directly. There are no memory-to-memory operations in our VM: everything happens in or through registers.
-
256 words of memory, each of which can store a single value. Both the program and its data live in this single block of memory; we chose the size 256 so that the address of each word will fit in a single byte.
Our processor’s instruction set defines what it can do. Instructions are just numbers, but we will write them in a simple text format called assembly code that gives those number human-readable names.
The instructions for our VM are 3 bytes long. The op code fits in one byte, and each instruction may include zero, one, or two single-byte operands. (Instructions are sometimes called bytecode, since they’re packed into bytes, but so is everything else in a computer.)
Each operand is a register identifier,
a constant,
or an address,
which is just a constant that identifies a location in memory.
Since constants have to fit in one byte,
this means that the largest number we can represent directly is 256.
Table 25.1 uses the letters r
and v
to indicate instruction format,
where r
indicates a register identifier
and v
indicates a constant value.
Name | Code | Format | Action | Example | Equivalent |
---|---|---|---|---|---|
hlt |
1 | -- |
Halt program | hlt |
sys.exit(0) |
ldc |
2 | rv |
Load constant | ldc R0 99 |
R0 = 99 |
ldr |
3 | rr |
Load register | ldr R0 R1 |
R0 = memory[R1] |
cpy |
4 | rr |
Copy register | cpy R0 R1 |
R0 = R1 |
str |
5 | rr |
Store register | str R0 R1 |
memory[R1] = R0 |
add |
6 | rr |
Add | add R0 R1 |
R0 = R0 + R1 |
sub |
7 | rr |
Subtract | sub R0 R1 |
R0 = R0 - R1 |
beq |
8 | rv |
Branch if equal | beq R0 99 |
if (R0==0) IP = 99 |
bne |
9 | rv |
Branch if not equal | bne R0 99 |
if (R0!=0) IP = 99 |
prr |
10 | r- |
Print register | prr R0 |
print(R0) |
prm |
11 | r- |
Print memory | prm R0 |
print(memory[R0]) |
To start building our virtual machine, we put the VM’s details in a file that can be loaded by other modules:
NUM_REG = 4 # number of registers
RAM_LEN = 256 # number of words in RAM
OPS = {
"hlt": {"code": 0x1, "fmt": "--"}, # Halt program
"ldc": {"code": 0x2, "fmt": "rv"}, # Load value
"ldr": {"code": 0x3, "fmt": "rr"}, # Load register
"cpy": {"code": 0x4, "fmt": "rr"}, # Copy register
"str": {"code": 0x5, "fmt": "rr"}, # Store register
"add": {"code": 0x6, "fmt": "rr"}, # Add
"sub": {"code": 0x7, "fmt": "rr"}, # Subtract
"beq": {"code": 0x8, "fmt": "rv"}, # Branch if equal
"bne": {"code": 0x9, "fmt": "rv"}, # Branch if not equal
"prr": {"code": 0xA, "fmt": "r-"}, # Print register
"prm": {"code": 0xB, "fmt": "r-"}, # Print memory
}
OP_MASK = 0xFF # select a single byte
OP_SHIFT = 8 # shift up by one byte
OP_WIDTH = 6 # op width in characters when printing
There isn’t a name for this design pattern, but putting all the constants that define a system in one file instead of scattering them across multiple files makes them easier to find as well as ensuring consistency.
Execution
We start by defining a class with an instruction pointer, some registers, and some memory along with a prompt for output. A program is just an array of numbers representing instructions. To load a program into our VM, we copy those numbers into memory and reset the instruction pointer and registers:
class VirtualMachine:
def __init__(self):
self.initialize([])
self.prompt = ">>"
def initialize(self, program):
assert len(program) <= RAM_LEN, "Program too long"
self.ram = [
program[i] if (i < len(program)) else 0
for i in range(RAM_LEN)
]
self.ip = 0
self.reg = [0] * NUM_REG
Notice that the VM’s constructor calls initialize
with an empty array
(i.e., a program with no instructions)
to do initial setup.
If an object has a method to reset or reinitialize itself,
having its constructor use that method
is a way to avoid duplicating code.
To execute the next instruction, the VM gets the value in memory that the instruction pointer currently refers to and moves the instruction pointer on by one address. It then uses bitwise operations (Chapter 17) to extract the op code and operands from the instruction (Figure 25.2).
def fetch(self):
instruction = self.ram[self.ip]
self.ip += 1
op = instruction & OP_MASK
instruction >>= OP_SHIFT
arg0 = instruction & OP_MASK
instruction >>= OP_SHIFT
arg1 = instruction & OP_MASK
return [op, arg0, arg1]
We always unpack two operands regardless of whether the instructions have them or not, since this is what most hardware implementations would do.
Processor Design
Some processors do have variable-length instructions, but they make the hardware more complicated and therefore slower. To decide whether these costs are worth paying, engineers rely on simulation and profiling (Chapter 15). Backward compatibility is also an issue: if earlier processors supported variable-length instructions, later ones must somehow do so as well in order to run old programs.
The next step is to add a run
method to our VM
that fetches instructions and executes them until told to stop:
def run(self):
running = True
while running:
op, arg0, arg1 = self.fetch()
if op == OPS["hlt"]["code"]:
running = False
elif op == OPS["ldc"]["code"]:
self.reg[arg0] = arg1
elif op == OPS["ldr"]["code"]:
self.reg[arg0] = self.ram[self.reg[arg1]]
elif op == OPS["cpy"]["code"]:
self.reg[arg0] = self.reg[arg1]
else:
assert False, f"Unknown op {op:06x}"
Let’s look more closely at three of these instructions.
The first, str
, stores the value of one register in the address held by another register:
elif op == OPS["str"]["code"]:
self.ram[self.reg[arg1]] = self.reg[arg0]
Adding the value in one register to the value in another register is simpler:
elif op == OPS["add"]["code"]:
self.reg[arg0] += self.reg[arg1]
as is jumping to a fixed address if the value in a register is zero.
This conditional jump instruction is how we implement if
:
elif op == OPS["beq"]["code"]:
if self.reg[arg0] == 0:
self.ip = arg1
Assembly Code
We could write out numerical op codes by hand just as early programmers did. However, it is much easier to use an assembler, which is just a small compiler for a language that very closely represents actual machine instructions.
Each command in our assembly languages matches an instruction in the VM. Here’s an assembly language program to print the value stored in R1 and then halt:
# Print initial contents of R1.
prr R1
hlt
Its numeric representation (in hexadecimal) is:
00010a
000001
One thing the assembly language has that the instruction set doesn’t
is labels on addresses in memory.
The label loop
doesn’t take up any space;
instead,
it tells the assembler to give the address of the next instruction a name
so that we can refer to @loop
in jump instructions.
For example,
this program prints the numbers from 0 to 2
(Figure 25.3):
|
|
Let’s trace this program’s execution (Figure 25.4):
- R0 holds the current loop index.
- R1 holds the loop’s upper bound (in this case 3).
- The loop prints the value of R0 (one instruction).
- The program adds 1 to R0. This takes two instructions because we can only add register-to-register.
- It checks to see if we should loop again, which takes three instructions.
- If the program doesn’t jump back, it halts.
The implementation of the assembler mirrors the simplicity of assembly language. The main method gets interesting lines, finds the addresses of labels, and turns each remaining line into an instruction:
class Assembler:
def assemble(self, lines):
lines = self._get_lines(lines)
labels = self._find_labels(lines)
instructions = [
ln for ln in lines if not self._is_label(ln)
]
compiled = [
self._compile(instr, labels) for instr in instructions
]
program = self._to_text(compiled)
return program
To find labels, we go through the lines one by one and either save the label or increment the current address (because labels don’t take up space):
def _find_labels(self, lines):
result = {}
loc = 0
for ln in lines:
if self._is_label(ln):
label = ln[:-1].strip()
assert label not in result, f"Duplicated {label}"
result[label] = loc
else:
loc += 1
return result
def _is_label(self, line):
return line.endswith(":")
To compile a single instruction we break the line into pieces, look up the format for the operands, and pack the values:
def _compile(self, instruction, labels):
tokens = instruction.split()
op, args = tokens[0], tokens[1:]
fmt, code = OPS[op]["fmt"], OPS[op]["code"]
if fmt == "--":
return self._combine(code)
elif fmt == "r-":
return self._combine(self._reg(args[0]), code)
elif fmt == "rr":
return self._combine(
self._reg(args[1]), self._reg(args[0]), code
)
elif fmt == "rv":
return self._combine(
self._val(args[1], labels),
self._reg(args[0]), code
)
To convert a value,
we either look up the label’s address (if the value starts with @
)
or convert the value to a number:
def _val(self, token, labels):
if token[0] != "@":
return int(token)
lbl = token[1:]
assert lbl in labels, f"Unknown label '{token}'"
return labels[lbl]
Combining op codes and operands into a single value is the reverse of the unpacking done by the virtual machine:
def _combine(self, *args):
assert len(args) > 0, "Cannot combine no arguments"
result = 0
for a in args:
result <<= OP_SHIFT
result |= a
return result
As a test, this program counts up to 3:
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
ldc R0 0
ldc R1 3
loop:
prr R0
ldc R2 1
add R0 R2
cpy R2 R1
sub R2 R0
bne R2 @loop
hlt
>> 0
>> 1
>> 2
R000000 = 000003
R000001 = 000003
R000002 = 000000
R000003 = 000000
000000: 000002 030102 00000a 010202
000004: 020006 010204 000207 020209
000008: 000001 000000 000000 000000
Arrays
It’s tedious to write programs when each value needs a unique name. We can do a lot more once we have arrays, so let’s add those to our assembler. We don’t have to make any changes to the virtual machine, which doesn’t care if we think of a bunch of numbers as individuals or elements of an array, but we do need a way to create arrays and refer to them.
We will allocate storage for arrays at the end of the program
by using .data
on a line of its own to mark the start of the data section
and then label: number
to give a region a name and allocate some storage space
(Figure 25.5).
This enhancement only requires a few changes to the assembler. First, we need to split the lines into instructions and data allocations:
DIVIDER = ".data"
def assemble(self, lines):
lines = self._get_lines(lines)
to_compile, to_allocate = self._split(lines)
labels = self._find_labels(lines)
instructions = [ln for ln in to_compile if not self._is_label(ln)]
base_of_data = len(instructions)
self._add_allocations(base_of_data, labels, to_allocate)
compiled = [self._compile(instr, labels) for instr in instructions]
program = self._to_text(compiled)
return program
def _split(self, lines):
try:
split = lines.index(self.DIVIDER)
return lines[0:split], lines[split + 1:]
except ValueError:
return lines, []
Second, we need to figure out where each allocation lies and create a label accordingly:
def _add_allocations(self, base_of_data, labels, to_allocate):
for alloc in to_allocate:
fields = [a.strip() for a in alloc.split(":")]
assert len(fields) == 2, f"Invalid allocation directive '{alloc}'"
lbl, num_words_text = fields
assert lbl not in labels, f"Duplicate label '{lbl}' in allocation"
num_words = int(num_words_text)
assert (base_of_data + num_words) < RAM_LEN, \
f"Allocation '{lbl}' requires too much memory"
labels[lbl] = base_of_data
base_of_data += num_words
And that’s it: no other changes are needed to either compilation or execution. To test it, let’s fill an array with the numbers from 0 to 3:
# Count up to 3.
# - R0: loop index.
# - R1: loop limit.
# - R2: array index.
# - R3: temporary.
ldc R0 0
ldc R1 3
ldc R2 @array
loop:
str R0 R2
ldc R3 1
add R0 R3
add R2 R3
cpy R3 R1
sub R3 R0
bne R3 @loop
hlt
.data
array: 10
R000000 = 000003
R000001 = 000003
R000002 = 00000e
R000003 = 000000
000000: 000002 030102 0b0202 020005
000004: 010302 030006 030206 010304
000008: 000307 030309 000001 000000
00000c: 000001 000002 000000 000000
Summary
Figure 25.6 summarizes the new ideas in this chapter. Real processors and the VMs for languages like Python are more complex, but experience shows that keeping things simple makes it much easier to make them fast and reliable.
Exercises
Swapping Values
Write an assembly language program that swaps the values in R1 and R2 without affecting the values in other registers.
Reversing an Array
Write an assembly language program that starts with:
- the base address of an array in one word
- the length of the array N in the next word
- N values immediately thereafter
and that reverses the array in place.
Increment and Decrement
-
Add instructions
inc
anddec
that add one to the value of a register and subtract one from the value of a register respectively. -
Rewrite the examples to use these instructions. How much shorter do they make the programs? Do they make it easier to read?
Using Long Addresses
-
Modify the virtual machine so that the
ldr
andstr
instructions contain 16-bit addresses rather than 8-bit addresses and increase the virtual machine’s memory to 64K words to match. -
How does this complicate instruction interpretation?
Operating on Strings
The C programming language stored character strings as non-zero bytes terminated by a byte containing zero.
-
Write a program that starts with the base address of a string in R1 and finishes with the length of the string (not including the terminator) in the same register.
-
Write a program that starts with the base address of a string in R1 and the base address of some other block of memory in R2 and copies the string to that new location (including the terminator).
-
What happens in each case if the terminator is missing?
Call and Return
-
Add another register to the virtual machine called SP (for “stack pointer”) that is automatically initialized to the last address in memory.
-
Add an instruction
psh
(short for “push”) that copies a value from a register to the address stored in SP and then subtracts one from SP. -
Add an instruction
pop
(short for “pop”) that adds one to SP and then copies a value from that address into a register. -
Using these instructions, write a subroutine that evaluates
2x+1
for every value in an array.
Disassembling Instructions
A disassembler turns machine instructions into assembly code.
Write a disassembler for the instruction set used by our virtual machine.
(Since the labels for addresses are not stored in machine instructions,
disassemblers typically generate labels like @L001
and @L002
.)
Linking Multiple Files
-
Modify the assembler to handle
.include filename
directives. -
What does your modified assembler do about duplicate label names? How does it prevent infinite includes (i.e.,
A.as
includesB.as
which includesA.as
again)?
Providing System Calls
Modify the virtual machine so that developers can add “system calls” to it.
-
On startup, the virtual machine loads an array of functions defined in a file called
syscalls.py
. -
The
sys
instruction takes a one-byte constant argument. It looks up the corresponding function and calls it with the values of R0-R3 as arguments and places the result in R0.
Unit Testing
-
Write unit tests for the assembler.
-
Once they are working, write unit tests for the virtual machine.