Introduction

A quick example

Are some programmers really ten times more productive than average? To find out, [Prechelt2000] had N programmers solve the same problem in the language of their choice, then looked at how long it took them, how good their solutions were, and how fast those solutions ran. The data is [available online][prechelt-data], and the first few rows look like this:

person,lang,z1000t,z0t,z1000mem,stmtL,z1000rel,m1000rel,whours,caps
s015,C++,0.050,0.050,24616,374,99.24,100.0,11.20,??
s017,Java,0.633,0.433,41952,509,100.00,10.2,48.90,??
s018,C,0.017,0.017,22432,380,98.10,96.8,16.10,??
s020,C++,1.983,0.550,6384,166,98.48,98.4,3.00,??
s021,C++,4.867,0.017,5312,298,100.00,98.4,19.10,??
s023,Java,2.633,0.650,89664,384,7.60,98.4,7.10,??
s025,C++,0.083,0.083,28568,150,99.24,98.4,3.50,??
...

The columns hold the following information:

Column Meaning
person subject identifier
lang programming language used
z1000t running time for z1000 input file
z0t running time for z0 input file
z1000mem memory consumption at end of z1000 run
stmtL program length in statement lines of code
z1000rel output reliability for z1000 input file
m1000rel output reliability for m1000 input file
whours total subject work time
caps subject self-evaluation

The z1000rel and m1000rel columns tell us that all of these implementations are correct 98% of the time or better, which is considered acceptable. The rest of the data is much easier to understand as a box-and-whisker plot of the working time in hours (the whours column from the table). Each dot is a single data point (jittered up or down a bit to be easier to see). The left and right boundaries of the box show the 25th and 75th percentiles respectively, i.e., 25% of the points lie below the box and 25% lie above it, and the mark in the middle shows the median ().

{% include figure id="boxplot" img="figures/boxplot.svg" cap="Development Time" title="Box-and-whisker plot show that most developers spent between zero and 20 hours but a handful took as long as 63 hours." %}

So what does this data tell us about productivity? As [Prechelt2019] explains, that depends on exactly what we mean. The shortest and longest development times were 0.6 and 63 hours respectively, giving a ratio of 105X. However, the subjects used seven different languages; if we only look at those who used Java (about 30% of the whole) the shortest and longest times are 3.8 and 63 hours, giving a ratio of "only" 17X.

But comparing the best and the worst of anything is guaranteed to give us an exaggerated impression of the difference. If we compare the 75th percentile (which is the middle of the top half of the data) to the 25th percentile (which is the middle of the bottom half) we get a ratio of 18.5/7.25 or 2.55; if we compare the 90th percentile to the 50th we get 3.7, and other comparisons give us other values.

Who are these lessons for?

Every lesson should aim to meet the needs of specific people [Wilson2019]. As these learner personas suggest, these lessons assume readers can write short Python programs and remember some college-level mathematics:

If you know what a Python dictionary is and can explain the difference between an exponent and a logarithm, you are probably ready to start. We cover data tidying and visualization, descriptive statistics, modeling, statistical tests, and reproducible research practices, and then use those tools to explore key findings from empirical software engineering research.

What does this material cover?

We chose our topics to teach people how to analyze messy real-world data correctly, what we already know about software and software development, and why we believe it's true. Our choice of examples was guided by [Begel2014], which asked several hundred professional software engineers what questions they most most wanted researchers to answer. The most highly-ranked questions include:

{: .continue} Our lessons don't answer all of these—in fact, most of them don't have answers—but we hope we can help people get started.

Moving Targets

[Begel2014] also asked what topics would be unwise for researchers to examine; all of the top responses were variations on, "Anything that measures individual employees' productivity." This belief reflects [Goodhart's Law][goodharts-law]: as soon as a measure is used to evaluate people, they adjust their behavior so that it ceases to be a useful measure.

{% include contents %}

All of our examples use Python or the Unix shell. We display Python source code like this:

{% include file file="loop.py" %}

{: .noindent} and Unix shell commands like this:

{% include file file="loop.sh" %}

{: .noindent} Data and programs' output are shown in italics:

Package,Releases
0,1
0-0,0
0-0-1,1
00print-lol,2
00smalinux,0
01changer,0

Acknowledgments

We are grateful to: