Simple Metrics

Measuring Delay

Params

@dataclass_json
@dataclass
class Params:
    t_job_interval: float = 2.0
    t_job_mean: float = 0.5
    t_job_std: float = 0.6
    t_sim: float = 10

Job class keeps track of when it was created, started, and completed
- t_start and t_complete may be null
Newly-created jobs add themselves to Job._all automatically
Jobs know how to convert themselves to JSON for persistence
- Use util.rnd(…) to round values to PRECISION decimal places

class Job:
    SAVE_KEYS = ["t_create", "t_start", "t_complete"]
    _next_id = count()
    _all = []

    @classmethod
    def reset(cls):
        cls._next_id = count()
        cls._all = []

    def __init__(self, sim):
        Job._all.append(self)
        self.id = next(Job._next_id)
        self.duration = sim.rand_job_duration()
        self.t_create = sim.env.now
        self.t_start = None
        self.t_complete = None

    def json(self):
        return {key: util.rnd(self, key) for key in self.SAVE_KEYS}

Simulation
- Remember to reset the record of jobs done at the start of the simulation
- Otherwise, data from multiple scenarios will pile up
- Yes, this is a design smell and we should fix it

@dataclass
class Simulation(Environment):
    def __init__(self):
        super().__init__()
        self.params = Params()
        self.queue = Store(self)

    def simulate(self):
        Job.reset()
        self.queue = Store(self.env)
        self.env.process(manager(self))
        self.env.process(coder(self))
        self.env.run(until=self.params.t_sim)

manager and coder are straightforward

def manager(sim):
    while True:
        job = Job(sim=sim)
        yield sim.queue.put(job)
        yield sim.env.timeout(sim.rand_job_arrival())


def coder(sim):
    while True:
        job = yield sim.queue.get()
        job.t_start = sim.env.now
        yield sim.env.timeout(job.duration)
        job.t_complete = sim.env.now

Output with default parameters

## jobs
shape: (8, 9)
┌──────────┬─────────┬────────────┬──────┬───┬────────────────┬────────────┬───────────┬───────┐
│ t_create ┆ t_start ┆ t_complete ┆ iter ┆ … ┆ t_job_interval ┆ t_job_mean ┆ t_job_std ┆ t_sim │
│ ---      ┆ ---     ┆ ---        ┆ ---  ┆   ┆ ---            ┆ ---        ┆ ---       ┆ ---   │
│ f64      ┆ f64     ┆ f64        ┆ i32  ┆   ┆ f64            ┆ f64        ┆ f64       ┆ i32   │
╞══════════╪═════════╪════════════╪══════╪═══╪════════════════╪════════════╪═══════════╪═══════╡
│ 0.0      ┆ 0.0     ┆ 0.68       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 3.12     ┆ 3.12    ┆ 3.99       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 4.69     ┆ 4.69    ┆ 8.33       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 5.25     ┆ 8.33    ┆ 9.15       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 6.87     ┆ 9.15    ┆ null       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 8.84     ┆ null    ┆ null       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 9.0      ┆ null    ┆ null       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
│ 9.37     ┆ null    ┆ null       ┆ 0    ┆ … ┆ 2.0            ┆ 0.5        ┆ 0.6       ┆ 10    │
└──────────┴─────────┴────────────┴──────┴───┴────────────────┴────────────┴───────────┴───────┘

Plot delays for three different simulation durations
- This is what we mean by parameter sweeping

if __name__ == "__main__":
    args, results = util.run(Params, Simulation)
    jobs = util.as_frames(results)["jobs"]
    jobs = jobs \
        .filter(pl.col("t_start").is_not_null()) \
        .sort("t_create") \
        .with_columns((pl.col("t_start") - pl.col("t_create")).alias("delay"))
    fig = px.line(jobs, x="t_start", y="delay", facet_col="t_sim")

job delays vs. time — Figure 1: Job delays vs. time

Four Metrics

Reminder that we are interested in:
- Backlog: how much work is waiting to start vs. time?
- Delay: how long from job creation to job start?
- Throughput: how many jobs are completed per unit time?
- Utilization: how busy are the people on the team?
Use classes for processes instead of naked generators
- Gives us a place to store extra data and access it from outside
Recorder base class creates unique per-class IDs and saves instances
- A generalization of the machinery we build for the Job class above
- Reset IDs and object lists in between parameter sweeps
- Expect derived classes to define SAVE_KEYS to identify what to save as JSON

class Recorder:
    _next_id = defaultdict(count)
    _all = defaultdict(list)

    @staticmethod
    def reset():
        Recorder._next_id = defaultdict(count)
        Recorder._all = defaultdict(list)

    def __init__(self, sim):
        cls = self.__class__
        self.id = next(self._next_id[cls])
        self._all[cls].append(self)
        self.sim = sim

    def json(self):
        return {key: util.rnd(self, key) for key in self.SAVE_KEYS}

Manager doesn't really need to be a class, but consistency makes code easier to understand

class Manager(Recorder):
    def run(self):
        while True:
            job = Job(sim=self.sim)
            yield self.sim.queue.put(job)
            yield self.sim.timeout(self.sim.rand_job_arrival())

Coder keeps track of how much time it has spent working

class Coder(Recorder):
    SAVE_KEYS = ["t_work"]

    def __init__(self, sim):
        super().__init__(sim)
        self.t_work = 0

    def run(self):
        while True:
            job = yield self.sim.queue.get()
            job.t_start = self.sim.now
            yield self.sim.timeout(job.duration)
            job.t_complete = self.sim.now
            self.t_work += job.t_complete - job.t_start

Monitor records the length of the queue every few ticks
- SimPy Store keeps items in a list-like object queue.items

class Monitor(Recorder):
    def run(self):
        while True:
            self.sim.lengths.append(
                {"time": self.sim.now, "length": len(self.sim.queue.items)}
            )
            yield self.sim.timeout(self.sim.params.t_monitor)

Simulation creates instances and calls their .run() methods
- After resetting all the recording

class Simulation:
    # …as before…
    def run(self):
        Recorder.reset()
        self.queue = Store(self.env)
        self.env.process(Manager(self).run())
        self.env.process(Coder(self).run())
        self.env.process(Monitor(self).run())
        self.env.run(until=self.params.t_sim)

Report results

class Simulation:
    # …as before…
    def result(self):
        return {
            "jobs": [job.json() for job in Recorder._all[Job]],
            "coders": [coder.json() for coder in Recorder._all[Coder]],
            "lengths": self.lengths,
        }

Use Polars and Plotly Express to analyze and plot
Run the simulation twice each for 200 and 1000 ticks

backlog vs. time — Figure 3: Backlog vs. time

Table 1: Throughput
id	t_sim	num_jobs	throughput
0	200	96	0.48
1	200	92	0.46
2	1000	489	0.49
3	1000	492	0.49

Table 2: Utilization
id	t_sim	total_work	utilization
0	200	172.94	0.86
1	200	175.23	0.88
2	1000	947.52	0.95
3	1000	931.92	0.93

Backlog and delay track each other pretty closely, so we only need to measure one or the other.
Throughput stabilizes right away. Utilization takes a little longer, but even then the change is pretty small as we increase the length of the simulation.

Imagine you're the manager in the fourth scenario. You might panic as backlog starts to rise, not realizing that it's just random variation.

Varying Arrival Rate

Vary the job arrival rate from 0.5 to 4.0

Figure 4: backlog vs. time with varying job arrival rates

Seems the backlog either grows or doesn't
Zoom in and vary job arrival rate from 1.0 to 2.0

Figure 5: backlog vs. time with narrower range of job arrival rates

λ (lambda) is the arrival rate: the average number of jobs arriving per unit time
μ (mu) is the service rate: the average number of jobs a single server can serve per unit time
ρ (rho) is the utilization: the fraction of time the server is busy
ρ = λ/μ for single-server systems if λ < μ
- If λ ≥ μ, the queue grows without limit
Average waiting time in queue W = ρ/(μ (1-ρ))
- Think of 1-ρ as spare capacity
- As the system approaches saturation, waiting times increase rapidly
So if all the programmers are busy 100% of the time, the waiting time for new work explodes
There must be slack in the system in order to keep waiting times down

Little's Law

λ (lambda) is the arrival rate
L is the average number of customers in the system
W is average time a customer spends in the system
Little's Law: L = λW
Exercise: test this by modifying the simulation to allow multiple coders