The Law of Large Numbers
- Problem: how can we do hypothesis testing
- More quickly (five hours of simulation to answer one question is a lot)
- And more confidently (is 5000 simulations enough? Would 100 work? Do we need a million?)
- Solution: use statistics
- Make some very general assumptions about our data
- Calculate an answer based on rules that hold for large datasets
What is the law of large numbers?
- Function describing probabilities of discrete events is called the probability mass function
- When describing continuous events, use:
- Cumulative distribution function \(F(x) = P(X \leq x)\)
- Probability density function \(f(x) = dF/dx\)
- So \(P(a \lt X \lt B) = \int_{a}^{b} f(x) dx\)
- Require \(\int_{-\infty}^{\infty} f(x) dx = 1\)
- I.e., something has to happen
- And notice \(P(x) = P(x \leq X \leq x) = \int_{x}^{x} f(x) dx = 0\)
- I.e., probability of any specific exact value is 0
- So always talk about ranges
- Mean is \(\mu = \int_{-\infty}^{\infty} x f(x) dx\)
- Variance is \(\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x) dx = \int_{-\infty}^{\infty} x^2 f(x) dx - \mu^2\)
-
Normally use standard deviation \(\sigma\) because it has the same units as the data
- Saves us from trying to figure out what a squared price is...
-
Example: uniform distribution has equal probability over a finite range \([a \ldots b]\)
- \(f(x) = \frac{1}{b - a}\)
- \(P(a \leq t \leq X \leq t+h \leq b) = \frac{h}{b - a}\)
- I.e., probability is proportional to fraction of range
- Standard uniform distribution has range \([0 \ldots 1]\)
- \(\mu = \frac{1}{2}\)
- \(\sigma^2 = \int_{0}^1 x^2 dx - (\frac{1}{2})^2 = \frac{1}{12}\)
What is the normal distribution and why do we care?
- In its full glory, normal distribution has
\( \begin{align} f(x) & = & \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}} \end{align} \)
- There is no closed formula for the integral \(F(x)\)
- But as the notation suggests, mean is \(\mu\) and variance is \(\sigma^2\)
-
The standard normal distribution \(Z\) has mean \(\mu = 0\) and standard deviation \(\sigma = 1\)
- Easy to move back and forth between this and arbitrary distribution \(X = \mu + \sigma Z\)
-
- Let \(S_n = X_1 + X_2 + \ldots + X_n\) be the sum of \(n\) independent random variables, all with mean \(\mu\) and standard deviation \(\sigma\)
- Can be drawn from (almost) any distribution
- As \(n \rightarrow \infty\), \(\frac{S_n - n\mu}{\sigma \sqrt{n}}\) converges on a standard normal random variable
- I.e., the distribution of our estimates of the mean is normal regardless of the underlying distribution
- Rate of convergence is \(\frac{1}{\sqrt{n}}\)
- I.e., to double the precision, quadruple the sample size
-
Heuristic: for \(n \gt 30\), \(S_n\) is distributed normally
-
Sample mean \(\bar{X}\) estimates the population mean
- Variance of \(\bar{X}\) is \(\frac{\sigma^2}{n}\)
- Distribution of sample means is normal, i.e. \(\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}\) is standard normal as \(n \rightarrow \infty\)
- Regardless of the underlying distribution of \(X\)
- FIXME: add program to sample various uniform distributions and see how the sampling converges on a uniform distribution
How can we use this to quantify confidence?
- A confidence interval is an interval \([a \ldots b]\)
that has some probability \(p\) of containing the actual value of a statistic
- E.g., "There is a 90% probability that the actual mean of this population lies between 2.5 and 3.5"
- Larger intervals are less precise but have a higher probability
- If there are more than 30 samples or the standard deviation \(\sigma\) is known, use a Z-test:
- Choose a confidence level \(C\) (typically 95%)
- Find the value \(z^{\star}\) such that \(P(x \leq z^{\star}) \leq \frac{1 - C}{2}\)
in a standard normal distribution
- Divide by 2 because the normal curve has two symmetric tails
- Calculate the sample mean \(\bar{X}\)
- Interval is \(\bar{X} \pm z^{\star}\frac{\sigma}{\sqrt{n}}\)
{% include figure id="two-tailed-test" cap="Two-Tailed Significance Test" fixme=true alt="FIXME" title="Normal curve overlaid on grid. Symmetric segments in the low and high ends of the normal curve are highlighted to show regions more than a certain distance from the cente fixme=true ." width="50%" credit="'Boundless Statistics', Lumen Learning, https://courses.lumenlearning.com/boundless-statistics/chapter/hypothesis-testing-one-sample/" %}
- FIXME: example
Student's t-distribution
- Usually don't know the distribution's variance
- The sample variance is:
\( \begin{align} s^2 & = & \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})^2 \ & = & \frac{\sum X_i^2 - n\bar{X}^2}{n - 1} \end{align} \)
- Using \(n-1\) instead of \(n\) ensures that \(s^2\) is unbiased (the Bessel correction)
- See proof
- Student's t-distribution is used to estimate the mean of a normally distributed population
when the sample size is small (e.g., less 30) and the variance is unknown
- Named comes from a pseudonym used by the mathematician who first used it this way
- The variable \(\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}\) has a standard normal distribution
- However, the variable \(\frac{\bar{X} - \mu}{s / \sqrt{n}}\) has a t-distribution
with \(n-1\) degrees of freedom
- Called degrees of freedom because once \(n-1\) values are known, the value of the \(n^{th}\) is fixed
- \(n-1\) because there's a step in the calculation that normalizes the \(n\) values to unit length
- The exact formula for the t-distribution is a little bit scary.
- The PDF's shape resembles that of a normal distribution with mean 0 and variance 1, but is slightly lower and wider.
- The two become closer as the degrees of freedom \(\nu\) gets larger.
-
A t-test follows the same steps as a Z-test:
- Choose a confidence level \(C\)
- Find a value \(t^{\star}\) such that \(P(x \leq t^{\star}) \leq \frac{1 - C}{2}\) in a Student's t-distribution with \(n-1\) degrees of freedom
- Estimate the standard deviation \(s\)
- Interval is \(\bar{X} \pm t^{\star}\frac{s}{\sqrt{n}}\)
-
FIXME: example
How can we compare the means of two datasets?
- What is the probability of seeing this difference between two datasets?
- The null hypothesis \(H_0\) is that the samples come from a single population and the observed difference is purely due to chance
- The alternative hypothesis \(H_A\) is that the samples come from two difference populations
- False positive: decide that the difference is not purely random when it is
- False negative: decide the difference is purely random when it isn't
- Also called Type I and Type II errors (but see https://twitter.com/neilccbrown/status/1202595479890124801)
- Adapt the simulation program (keep a subset of the command-line parameters)
from scipy.stats import ttest_ind
def main():
# ...parse arguments...
# ...read data and calculate actual means and difference...
# test and report
result = ttest_ind(data_left, data_right)
- Run
python bin/t-test.py --left ../hypothesis-testing/data/javascript-counts.csv --right ../hypothesis-testing/data/python-counts.csv --low 1 --high 200
Ttest_indResult(statistic=-269.67014904687954, pvalue=0.0)
- The \(p\) value is so small that the computer can't distinguish it from zero
-
Which means the chances of getting this difference by randomly splitting a single population is vanishingly small
-
Look at the hours worked per day in 2019
- Data is (date, hours) pairs taken from a spreadsheet
- There are a lot of spreadsheets in data science
- Split into weekday and weekend subsets and visualize
- Note that hours are never actually negative, but the curve is drawn that way
{% include figure id="programmer-hours" cap="Programmer Hours (Weekday vs. Weekend)" alt="FIXME" title="A pair of vertical violin plots. The mean for weekday equals false is near 2.1 hours per day and the mean for weekday equals true is slightly above 7 hours per day. The profile for weekday equals false does not look normal, but the profile for weekday equals true looks more normal." fixme=true %}
- They certainly seem different
- And a t-test confirms it
- The odds are large enough this time to be printable...
python bin/weekends.py --data data/programmer-hours.csv
weekday mean 6.804375000071998
weekend mean 3.232482993312492
Ttest_indResult(statistic=12.815512046971827, pvalue=6.936182610195961e-31)
Higher standards
- Recall discussion of \(p\) hacking from
- If we analyze the data enough different ways, one of them will be "significant"
- Use the Bonferroni correction
- The more tests we do, the more stringest our significance criteria must be