Hypothesis Testing
- Problem: how to estimate development effort or likely fault rates
- Intuitively seems like this should depend on some measurable properties of the software
- Lines of code (LoC) is easy to count but seems like a pretty coarse measure
- How can we count other things?
- How can we tell whether a hypothesis is true or not?
- Steps are:
- Collect data
- Formulate testable hypotheses
- Calculate likelihood of getting the answer we observe
- Decide whether that answer is so unlikely that we probably didn't get it by chance
How can we get data on file sizes?
- Start by asking whether Python and JavaScript files are different, i.e., how our measurements depend on source language
- Parsing arguments and writing CSV is normal by now
- Use
os.walk
to recurse through directories - After the first failure because of bad character encoding, add error handling
- Then capture
stderr
to a log file for later inspection - Then realize that storing (filename, length, count):
- Is too big for GitHub unless we compress it
- Contains a lot of redundancy
- Instead of compressing the file, create two files:
- One stores (filename, unique ID)
- The other stores (unique ID, length, count)
- So the largest field (filename) only appears once
- And gives us a natural way to record files we couldn't read because of character encoding issues (
NA
for file ID)
-
This technique is also useful when storing personal identifying information (PID)
- Shareable data is keyed
- Personal data is not shared
- Beware of de-anonymization
-
Main driver is straightforward
def main():
'''
Main driver.
'''
args = parse_args()
counts = lengths(args)
report_filenames(args.output, counts)
report_counts(args.output, counts)
{: title="bin/file-size.py"}
- Parse arguments
- Root directory
- Filename extension
- Stem of output files (because there are two)
- E.g., processing
data/python
will producedata/python-filenames.csv
anddata/python-counts.csv
def parse_args():
'''
Handle command-line arguments.
'''
parser = argparse.ArgumentParser()
parser.add_argument('--root', type=str, help='root directory')
parser.add_argument('--ext', type=str, help='extension')
parser.add_argument('--output', type=str, help='stem of output files')
return parser.parse_args()
{: title="bin/file-size.py"}
- Find files and record lengths
- A dictionary mapping filenames to
Counter
objects - Or to
None
if the file couldn't be read
- A dictionary mapping filenames to
def lengths(args):
'''
Find files and count line lengths.
'''
counts = {}
for (curr_dir, sub_dirs, files) in os.walk(args.root):
for filename in [x for x in files if x.endswith(args.ext)]:
path = os.path.join(curr_dir, filename)
with open(path, 'r') as reader:
try:
counts[path] = Counter()
for x in reader.readlines():
counts[path][len(x)] += 1
except Exception as e:
print(f'Failed to read {path}: {e}',
file=sys.stderr)
counts[path] = None
return counts
{: title="bin/file-size.py"}
- Write the filenames or
NA
- Uses
X if test else Y
, which is frequently confusing
- Uses
def report_filenames(output_stem, counts):
'''
Report filename-to-ID as CSV with 'NA' for errored files.
'''
with open(f'{output_stem}-filenames.csv', 'w') as writer:
writer = csv.writer(writer, lineterminator='\n')
writer.writerow(['Filename', 'FileID'])
for (file_id, filename) in enumerate(sorted(counts.keys())):
writer.writerow([filename, file_id if counts[filename] else 'NA'])
{: title="bin/file-size.py"}
- And the counts where available
def report_counts(output_stem, counts):
'''
Report file ID-to-count as CSV for non-errored files.
'''
with open(f'{output_stem}-counts.csv', 'w') as writer:
writer = csv.writer(writer, lineterminator='\n')
writer.writerow(['FileID', 'Length', 'Count'])
for (file_id, filename) in enumerate(sorted(counts.keys())):
if counts[filename]:
for (length, freq) in counts[filename].most_common():
writer.writerow([file_id, length, freq])
{: title="bin/file-size.py"}
python bin/file-size.py --root /anaconda3 --ext .py --output data/python
- Get some errors
- Store in
log/file-size-python-errors.txt
for later reference
- Store in
Failed to read /anaconda3/pkgs/xlwt-1.3.0-py37_0/lib/python3.7/site-packages/xlwt/BIFFRecords.py: 'utf-8' codec can't decode byte 0x93 in position 68384: invalid start byte
Failed to read /anaconda3/pkgs/xlwt-1.3.0-py37_0/lib/python3.7/site-packages/xlwt/UnicodeUtils.py: 'utf-8' codec can't decode byte 0xb7 in position 1950: invalid start byte
Failed to read /anaconda3/pkgs/pylint-2.3.1-py37_0/lib/python3.7/site-packages/pylint/test/functional/implicit_str_concat_in_sequence_latin1.py: 'utf-8' codec can't decode byte 0xe9 in position 96: invalid continuation byte
Failed to read /anaconda3/pkgs/joblib-0.13.2-py37_0/lib/python3.7/site-packages/joblib/test/test_func_inspect_special_encoding.py: 'utf-8' codec can't decode byte 0xa4 in position 64: invalid start byte
...
{: title="log/file-size-python-errors.txt"}
- Also run
python bin/file-sizes.py ~/blocks/node_modules
blocks
is a JavaScript project we just have to have lying aroundnode_modules
is where the project's JavaScript packages are stored
- Create scatter plots of to show frequency of line lengths
- Add a pattern rule to
Snakefile
- Assign the languages and kinds of figures to variables
- Fill in all combinations of those variables with
expand
- Create a target called
all
that depends on the expanded list of files
LANGS = ['javascript', 'python']
KINDS = ['all', 'trimmed']
rule all:
input:
expand('figures/{lang}-counts-{kind}.svg', lang=LANGS, kind=KINDS)
rule count_all:
input:
'data/{lang}-counts.csv'
output:
'figures/{lang}-counts-all.svg'
shell:
'python bin/length-frequency-plot.py --data {input} --fig {output} --logx'
rule count_trimmed:
input:
'data/{lang}-counts.csv'
output:
'figures/{lang}-counts-trimmed.svg'
shell:
'python bin/length-frequency-plot.py --data {input} --fig {output} --low 2 --high 200'
{: title="Snakefile"}
- Create log-log histograms for frequency of line lengths in JavaScript and Python
{% include figure id="javascript-counts-all" img="figures/javascript-counts-all.svg" cap="Frequency of Line Lengths (JavaScript, All)" alt="FIXME" title="Log-log scatter plot with a point at Y equals 200,000 and X equals 1 and then a noticeable decline for X above 80." %}
{% include figure id="python-counts-all" img="figures/python-counts-all.svg" cap="Frequency of Line Lengths (Python, All)" alt="FIXME" title="Log-log scatter plot with a point at Y equals 2,000,000 at X equals 1 and then a sharp decline for X above 80." %}
- Exclude lines of length 1
- These only contain the newline character, i.e., they're empty
- Exclude lines longer than 200 characters
- Possibly a character encoding issue
- Or minification of JavaScript
{% include figure id="javascript-counts-trimmed" img="figures/javascript-counts-trimmed.svg" cap="Frequency of Line Lengths (JavaScript, Trimmed)" alt="FIXME" title="Log-linear plot showing Y approximately 20,000 for X up to 50 and steady decline thereafter." %}
{% include figure id="python-counts-trimmed" img="figures/python-counts-trimmed.svg" cap="Frequency of Line Lengths (Python, Trimmed)" alt="FIXME" title="Log-linear plot showing Y approximately 20,000 for X below 10, Y between 100,000 and 200,000 for X up to 80, and a very sharp decline thereafter." %}
- These curves look different, but are they?
How can we tell if two populations are different?
- Null hypothesis: there is no significant difference between these curves
- Alternative hypothesis: there is a significant difference between these curves
- Reject the null hypothesis if the probability of getting this data if the null hypothesis is true
is less than a chosen probability called a \(p\) value
- Typically use \(p = 0.05\), i.e., less than 1 chance in 20 of getting the data if there isn't actually a difference
- The lower the \(p\) value, the higher our confidence
Kick the ball then move goal
\(p\) values can be mis-used in several ways. The most obvious is to choose a \(p\) value after the fact in order to get a significant result: if you ever see reports that mix several different \(p\) values or use odd numbers like 0.073, this is probably what's going on.
The second form of abuse, called p hacking, is to re-analyze the data over and over until a "significant" result emerges. Consider: if the odds of getting a false positive for one analysis are 0.05, then the odds of getting a true negative are 0.95. The odds of getting two true negatives in a row are therefore \(0.95^2\), which is 0.9025. If we keep going, the odds of none of our analyses meeting this threshold are 50/50 when we do 14 analyses. One sign that people are p hacking is that they find niche results like, "This treatment was effective for left-handed non-smokers between the ages of 45 and 55." The best way to safeguard against p hacking is to pre-register studies, i.e., to declare before collecting data what analyses are going to be done and how.
- But how do we measure the likelihood?
- One approach is to use simulation
- Combine data sets
- Split into two pieces at random (where each piece is the same size as one of the originals)
- See how far apart the means are
- Repeat a few thousand times
- See how often we get the difference we actually see
- Decide if we believe the difference is likely
#!/usr/bin/env python
import sys
import argparse
import numpy as np
import pandas as pd
def main():
'''
Sample two datasets to calculate odds of means being distant.
'''
# ...parse arguments...
# ...read data and calculate actual means and difference...
# ...repeatedly sample and calculate difference...
# ...report...
{: title="bin/simulate.py"}
- Arguments are pretty simple
- Two files
- Cut the low and high values as we did for plotting
- Number of trials
- Seed for random number generator so we can reproduce results
# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--left', type=str, help='first dataset')
parser.add_argument('--right', type=str, help='second dataset')
parser.add_argument('--low', type=int, help='lower limit')
parser.add_argument('--high', type=int, help='upper limit')
parser.add_argument('--trials', type=int, help='number of trials (>0)')
parser.add_argument('--seed', type=int, help='RNG seed')
args = parser.parse_args()
np.random.seed(args.seed)
{: title="bin/simulate.py"}
- Read the data and calculate actual statistics
# read data and calculate actual means and difference
data_left = read_data(args.left, args.low, args.high)
data_right = read_data(args.right, args.low, args.high)
mean_left = data_left.mean()
mean_right = data_right.mean()
actual_diff = mean_left - mean_right
{: title="bin/simulate.py"}
- And:
def read_data(filename, low, high):
'''
Read data and remove lines of length 1 and very long lines.
'''
data = pd.read_csv(filename)
if (high != None):
data = data[data.Length <= high]
if (low != None):
data = data[data.Length >= low]
return data.Length.repeat(repeats=data.Count)
{: title="bin/simulate.py"}
- For each simulation:
- Shuffle the index
- Cut the data in two
- Calculate means for the sections
- Record difference in the means
- Afterward:
- Calculate the fraction of differences that are less than or equal to the actual difference
- For comparison, also report the average difference in the means
# repeatedly sample and calculate difference
combined = data_left.append(data_right).reset_index(drop=True)
split = len(data_left)
sample_diffs = []
for t in range(args.trials):
shuffle = np.random.permutation(len(combined))
sample_left = combined[shuffle[:split]]
sample_right = combined[shuffle[split:]]
sample_diffs.append(sample_left.mean() - sample_right.mean())
sample_diff_mean = sum(sample_diffs) / args.trials
success_frac = sum([x <= actual_diff for x in sample_diffs]) / args.trials
{: title="bin/simulate.py"}
- Write results as YAML
- A long two-column CSV with (title, value) pairs would work
- As would a wide CSV with one row of titles and one row of values
- Either way, must write parameter values to ensure reproducibility
# report
print(f'parameters:')
print(f'- left: "{args.left}"')
print(f'- right: "{args.right}"')
print(f'- low: {"null" if (args.low is None) else args.low}')
print(f'- high: {"null" if (args.high is None) else args.high}')
print(f'- trials: {args.trials}')
print(f'- seed: {args.seed}')
print(f'results:')
print(f'- mean_left: {mean_left}')
print(f'- mean_right: {mean_right}')
print(f'- actual_diff: {actual_diff}')
print(f'- sample_diff: {sample_diff_mean}')
print(f'- successes: {success_frac}')
{: title="bin/simulate.py"}
How can we test our approach?
- We're showing completed working code, but we didn't get it right the first time
- Create two "populations" and see if we get a plausible answer
[1, 1]
vs.[10, 10]
python bin/simulate.py --left test/test-a.csv --right test/test-b.csv --trials 10000 --seed 1234567
parameters:
- left: "test/test-a.csv"
- right: "test/test-b.csv"
- low: null
- high: null
- trials: 10000
- seed: 1234567
results:
- mean_left: 1.0
- mean_right: 10.0
- actual_diff: -9.0
- sample_diff: -0.0126
- successes: 0.1661
- Is that right?
- 8/24 permutations of the four values are a "perfect split"
- But only half of those have the low values in the lower half
- So we expect 4/24 successes, which is 0.1666... and we actually got 0.1661
- Our simulation reports that the expected difference is 0.0 and the simulated difference is -0.0126
- Seems to be doing what we want
- We should test with unequal counts and numbers of samples
- Left as an exercise
- And note that this doesn't test if the means are different, only if the first is less than the second
- A one-sided test
- If this says "yes" and we reverse the order of the datasets, the answer will be "no"
- Convert this into a two-sided test in the exercises
python bin/simulate.py --left data/javascript-counts.csv --right data/python-counts.csv --trials 5000 --seed 57622
parameters:
- left: "data/javascript-counts.csv"
- right: "data/python-counts.csv"
- low: null
- high: null
- trials: 5000
- seed: 57622
results:
- mean_left: 38.168300030969725
- mean_right: 36.41329390723824
- actual_diff: 1.7550061237314836
- sample_diff: 0.00398256511711528
- successes: 1.0
- None of the 5000 random trials had a difference as big as what we actually see
- So it seems pretty certain that the difference is more than just chance
- But it takes about two hours to run 5000 simulations on our real data
- We should add a
--verbose
flag to report progress - Or print the stats every N iterations so we can check convergence interactively and decide when to cut things off
- We should add a
- And we should find a better way to answer the question
- Take less time
- Tell us how confident we can be in the answer
- Because right now 5000 is a magic number