Theory

A little theory can go a long way.

Basic probability

Assume a set of events \(E\) and a probability function \(P\) with \(0 \leq P(x) \leq 1\)
If \(A\) and \(B\) are disjoint sets of events from \(E\) then:
- \(P(E) = 1\) (i.e., something must happen)
- \(P(A \cup B) = P(A) + P(B)\) (i.e., probability of the union of disjoint sets is the sum of their probabilities)
A few consequences:
- If \(A\) and \(B\) are not disjoint, then \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
- \(P(\neg A) = 1 - P(A)\)
- If \(A\) and \(B\) are disjoint, \(P(A \cup B) = P(A)P(B)\)
- The probability of \(n\) equally likely disjoint events \(x_i\) is \(nP(x_i)\)

Combinatorics

There are \(n^{k}\) ways to sample \(k\) distinguishable items from \(n\) with replacement
There are \(P(n, k) = \frac{n!}{(n - k)!}\) ways to sample \(k\) distinguishable items from \(n\) without replacement
If the items are indistinguishable, the number of selections is \(C(c, k) = \binom{c}{k} = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}\)
- Divide by \(k!\) because the indistinguishable items can be rearranged that many ways

Conditional probability

The conditional probability \(P(A \mid B)\) of \(A\) given \(B\) is the probability of \(A\) occurring given that \(B\) is known to occur
\(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)

Bayes Rule

Since \(P(A \mid B) = P(B \mid A)\), we have Bayes' Rule \(P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A)}\)
Special case: since \(P(A) = P(A \mid B) + P(A \mid \neg B)\), \(P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A \mid B)P(B) + P(A \mid \neg B)P(\neg B)}\)

Mean and variance

The mean \(\mu\) or expected value \(E(X)\) is the weighted sum of possible outcomes \(\sum_{x} xP(x)\)
Implies \(E(aX + bY + c) = aE(X) + bE(Y) + c\)
The variance \(\sigma^2\) is the expected value of the square of the difference between values and the mean \(\sum_{x} (x - \mu)^2P(x)\)
- Squaring guarantees that values are positive
- And gives extra weight to outliers
- But the units are weird: "bugs squared"
\(\sigma^{2}_{A + B} = \sigma^2_A + \sigma^2_B\) if \(A\) and \(B\) are independent
The standard deviation \(\sigma\) is the square root of the variance
- Same units as original variable

Covariance and correlation

Covariance \(\sigma_{XY}\) of \(X\) and \(Y\) is \(E((X - \mu_{X})(Y - \mu_{Y}))\)
- If \(X\) and \(Y\) are both above or below their means at the same time, \(\sigma_{XY}\) will be positive
- If \(X\) is above when \(Y\) is below and vice versa, \(\sigma_{XY}\) will be negative
- If there is no relation, \(\sigma_{XY}\) will be zero
Pearson's correlation coefficient \(r_{XY}\) is covariance normalized by standard deviations
- \(r_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}\)
- Always lies in \([-1 \ldots 1]\)
Chebyshev's Inequality: \(P(\mid X - \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\)
- I.e., the probability of a value being more than \(\epsilon\) away from the mean is bounded by the square of the ratio between the standard deviation and \(\epsilon\)

Bernoulli distribution

A Bernoulli distribution is a random variable with just two values 1 and 0 (sometimes called success and failure)
- Named after the mathematician who first described it
Probability of success is \(p\)
\(\mu = 0(1 - p) + 1(p) = p\)
\(\sigma^2 = \sum (x - p)^2 P(x) = (0 - p)^2 (1 - p) + (1 - p)^2 p = p(1 - p)\)

Binomial distribution

A binomial distribution is the number of successes in \(n\) trials of a Bernoulli variable with probability \(p\)
- Name means "two numbers" (referring to \(n\) and \(k\))
Probability of exactly \(x\) successes in \(n\) trials is \(\binom{n}{x} p^x (1-p)^{n-x}\)
- Number of different arrangements of that many successes and that many failures
\(\mu = np\) (because trials are independent)
\(\sigma^2 = np(1-p)\) (because \(\sigma^{2}_{A + B} = \sigma^2_A + \sigma^2_B\) if \(A\) and \(B\) are independent)
Probability of up to \(x\) successes is complicated to calculate, but we can use approximations (FIXME)

Geometric distribution

A geometric distribution is the number of Bernoulli trials needed to get the first success
- Potentially
\(P(x) = (1 - p)^{x - 1}p\) (i.e., \(x-1\) failures followed by 1 success)
To find the mean, use the fact that \(\sum_{i=0}^{\infty} x^i = \frac{1}{1 - x}\) for \(0 \lt x \lt 1\)
- A geometric series, which gives the distribution its name
\(\mu = 1/p\)
The variance takes a little more work, but is \(\sigma^2\) is \(\frac{1-p}{p^2}\)

Negative binomial distribution

The binomial distribution describes the numbr of successes in a fixed number of trials
The negative binomial distribution is the number of trials required to achieve a certain number of successes
- "Negative" in the sense of opposite: there is nothing negative in the values
\(P(x) = \binom{x - 1}{k - 1} (1 - p)^{x - k} p^k\)
- Reading right to left, this is \(k\) successes, \(x-k\) failures, and the number of possible rearrangements with the last being a success (which is why the -1)
\(\mu = \frac{k}{p}\)
\(\sigma^2 = \frac{k(1-p)}{p^2}\)
With \(k=1\), the negative binomial is just the geometric distribution (number of trials to first success)

Poisson distribution

Number of events occurring within a fixed period has a Poisson distribution
- Assuming events never occur simultaneously
\(P(x) = e^{- \lambda}\frac{\lambda^x}{x!}\)
\(\mu = \lambda\)
\(\sigma^2 = \lambda\)
Poisson is a special case of binomial where the number of trials is very large and the probability of success in any trial is small
For \(n \geq 30\) and \(p \leq 0.05\), Poisson is a good approximation of binomial

Probability density and cumulative distribution

Function describing probabilities of discrete events is called the probability mass function
When describing continuous events, use:
- Cumulative distribution function \(F(x) = P(X \leq x)\)
- Probability density function \(f(x) = dF/dx\)
So \(P(a \lt X \lt B) = \int_{a}^{b} f(x) dx\)
Require \(\int_{-\infty}^{\infty} f(x) dx = 1\)
- And notice \(P(x) = P(x \leq X \leq x) = \int_{x}^{x} f(x) dx = 0\)
- I.e., probability of any specific exact value is 0
\(\mu = \int_{-\infty}^{\infty} x f(x) dx\)
\(\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x) dx = \int_{-\infty}^{\infty} x^2 f(x) dx - \mu^2\)

Uniform distribution

Uniform distribution has equal probability over a finite range \([a \ldots b]\)
\(f(x) = \frac{1}{b - a}\)
\(P(a \leq t \leq X \leq t+h \leq b) = \frac{h}{b - a}\)
- I.e., probability is proportional to fraction of range
Standard uniform distribution has range \([0 \ldots 1]\)
- \(\mu = \frac{1}{2}\)
- \(\sigma^2 = \int_{0}^1 x^2 dx - (\frac{1}{2})^2 = \frac{1}{12}\)

Exponential distribution

Assume the waiting time for the next rare event is independent of the time waited so far
- I.e., \(P(X \geq b+a \mid X \geq a) = P(X \geq b)\)
Can show that \(f(x) = \lambda e^{- \lambda x}\) is the unique solution, so \(F(x) = 1 - e^{- \lambda x}\)
\(\mu = \frac{1}{\lambda}\)
\(\sigma^2 = \frac{1}{\lambda^2}\)
The parameter \(\lambda\) is the frequency
- If \(\lambda = 2\) then we expect 2 events per unit time and \(\mu = \frac{1}{\lambda} = 0.5\) units of time between events
The Poisson distribution with parameter \(\lambda\) is the number of events in time \(t\) (discrete)

Gamma distribution

If there are \(\alpha\) independent sequential steps, each taking an exponentially-distributed time with rate \(\lambda\), then the total time has a Gamma distribution Gamma(\(\alpha\), \(\lambda\))
- Name comes from the gamma function, which extends factorial to complex numbers
\(f(x) = \frac{\lambda^\alpha}{\Gamma (\alpha)} x^{\alpha - 1} e^{- \lambda x}\)
- \(\alpha\) is sometimes called the shape parameter
- Gamma(1, \(\lambda\)) is an exponential distribution with rate \(\lambda\)
\(\mu = \frac{\alpha}{\lambda}\)
\(\sigma^2 = \frac{\alpha}{\lambda^2}\)
For specific values, use library or lookup tables
FIXME: relationship between Gamma and Poisson

Normal distribution

In its full glory, normal distribution has \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}\)
- There is no closed form for the integral \(F(x)\)
But as the notation suggests, means is \(\mu\) and variance is \(\sigma^2\)
The standard normal distribution \(Z\) has mean \(\mu = 0\) and standard deviation \(\sigma = 1\)
- Reconstruct arbitrary distribution \(X = \mu + \sigma Z\)

Central Limit Theorem

Let \(S_n = X_1 + X_2 + \ldots + X_n\) be the sum of \(n\) independent random variables, all with mean \(\mu\) and standard deviation \(\sigma\)
As \(n \rightarrow \infty\), \(\frac{S_n - n\mu}{\sigma \sqrt{n}}\) converges on a standard normal random variable
- Rate of convergence is \(\frac{1}{\sqrt{n}}\)
Heuristic: for \(n \gt 30\), a normal distribution is an accurate approximation to \(S_n\)
In particular, binomial(\(n\), \(p\)) \(\approx\) normal(\(\mu = np\), \(\sigma = \sqrt{n p (1 - p)}\) )
- But use a continuity correction: adjust interval by 0.5 units to allow for difference between discrete and continuous
- E.g., \(P(X = x) = P(x - 0.5 \lt X \lt x + 0.5)\) (because the probability of a point in a continuous distribution is zero)

Sampling

A population has a parameter; a sample has a statistic
Sample mean \(\bar{X}\) estimates the population mean \(\mu = E(X)\)
Variance of \(\bar{X}\) is \(\frac{\sigma^2}{n}\)
Distribution of sample means is normal, i.e. \(\frac{\bar{X} - \mu}{\sigma \sqrt{n}}\) is standard normal as \(n \rightarrow \infty\)
- Regardless of the underlying distribution of \(X\)
The median is the central value such that \(P(X \lt M) \leq 0.5\) and \(P(X \gt M) \leq 0.5\)
- If the distribution is skewed to the right, \(M \lt \mu\)
- If the distribution is skewed to the left, \(M \gt \mu\)
A quartile is a value that divides the sample into quarters
- E.g., first quartile \(Q_1\) splits values 25%/75%, third quartile \(Q_3\) splits values 75%/25%
- Second quartile is the same as the median
- The interquartile range IQR is \(Q_3 - Q_1\)
- Anything more than 1.5 IQR below \(Q_1\) or above \(Q_3\) is considered an outlier
- Because if the data is normal, only 0.7% of it should lie outside these bounds
The sample variance is:

\( \begin{align} s^2 & = & \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})^2 \ & = & \frac{\sum X_i^2 - n\bar{X}^2}{n - 1} \end{align} \)

Using \(n-1\) instead of \(n\) ensures that \(s^2\) is unbiased (the Bessel correction)

Parameter estimation

The \(k^{th}\) population moment is \(E(X^k)\)
The \(k^{th}\) sample moment is \(\frac{1}{n}\sum_{i=1}^{n} X_i^k\)
- The first sample moment is the sample mean
The population and sample central moments are \(E(X - \mu)^k\) and \(\frac{1}{n}\sum(X_i - \bar{X})^k\) respectively
- I.e., the moments after shifting the data to a mean of zero
Parameter estimation via the method of moments: find parameter values to match sample moments
Parameter estimation via maximum likelihood: choose parameter values to maximize the likelihood of the observed sample
Example: estimation \(\lambda\) for Poisson distribution
- PMF is \(P(x) = e^{- \lambda} \frac{\lambda^x}{x!}\)
- Taking logarithms, \(\ln{P(x)} = -\lambda + x \ln{\lambda} - \ln{x!}\)
- So we need to maximize \(\ln{P(x)} = \sum{(\lambda + X_i \ln{\lambda})} + C = -n \lambda + \ln{\lambda}\sum{X_i} + C\) where \(C = -\sum{\ln{x!}}\) is a constant that does not depend on \(\lambda\)
- Differentiating with respect to \(\lambda\) and setting equal to zero gives \(\frac{\partial}{\partial{\lambda}}\ln{P(X)} = -n + \frac{1}{\lambda}\sum{X_i} = 0\)
- Only solution is \(\lambda = \frac{1}{n}\sum{X_i} = \bar{X}\)

Student's t-distribution

Student's t-distribution is used to estimate the mean of a normally distributed population when the sample size is small (e.g., less 30) and the variance is unknown
- Named comes from a pseudonym used by the mathematician who first used it this way
If \(X\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\), then \(\bar{X} = \frac{1}{n}\sum{X_i}\) is the sample mean and \(s^2 = \frac{1}{n-1}\sum{(X_i - X)^2}\) is the Bessel-corrected sample variance
The variable \(\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}\) has a standard normal distribution
However, the variable \(\frac{\bar{X} - \mu}{s / \sqrt{n}}\) has a t-distribution with \(n-1\) degrees of freedom
- \(n-1\) because there's a step in the calculation that normalizes the \(n\) values to unit length
- Once \(n-1\) are known, the value of the \(n^{th}\) is fixed
The exact formula for the t-distribution is a little bit scary
- The PDF's shape resembles that of a normal distribution with mean 0 and variance 1, but is slightly lower and wider.
- The two become closer as the degrees of freedom \(\nu\) gets larger.

Confidence intervals

A confidence interval is an interval \([a \ldots b]\) that has some probability \(p\) of containing the actual value of a statistic
- E.g., "There is a 90% probability that the actual mean of this population lies between 2.5 and 3.5"
- Larger intervals have a higher probability but are less precise
If there are more than 30 samples or the standard deviation \(\sigma\) is known, use a z-test:
1. Choose a confidence level \(C\) (typically 95%)
2. Find the value \(z^{\star}\) such that \(P(x \leq z^{\star}) \leq \frac{1 - C}{2}\) in a standard normal distribution
  - Divide by 2 because the normal curve has two symmetric tails
3. Calculate the sample mean \(\bar{X}\)
4. Interval is \(\bar{X} \pm z^{\star}\frac{\sigma}{\sqrt{n}}\)
If there are fewer than 30 samples or the standard deviation isn't known, use a t-test:
1. Choose a confidence level \(C\)
2. Find a value \(t^{\star}\) such that \(P(x \leq t^{\star}) \leq \frac{1 - C}{2}\) in a Student's t-distribution with \(n-1\) degrees of freedom
3. Estimate the standard deviation \(s\)
4. Interval is \(\bar{X} \pm t^{\star}\frac{s}{\sqrt{n}}\)

Hypothesis testing

What is the probability of seeing this difference between two datasets?
- The null hypothesis \(H_0\) is that the samples come from a single population and the observed difference is purely due to chance
- The alternative hypothesis \(H_A\) is that the samples come from two difference populations
- False positive: decide that the difference is not purely random when it is
- False negative: decide the difference is purely random when it isn't
Example: if a coin comes up heads 9 times out of 10, what are the odds it is actually fair?
- Probability if the coin is fair is \(\binom{10}{9} \cdot 0.5^9 \cdot (1-0.5)^1 = 10 \cdot 0.00195 \cdot 0.5 = 0.00976\)
- I.e., less than 1% chance of seeing this result if the coin is far

Prediction

Accuracy is the fraction of correct predictions (true positive + true negative)
- Not useful if there are only a few defective items, since "all good" will have high accuracy
Precision is the fraction of positives that are actually positive, i.e. true positive / (true positive + false positive)
Recall is the fraction of positives the method can actually identify, i.e., true positive / (true positive + false negative)
Perfect precision and perfect recall mean all items are classified correctly
- But increasing precision often reduces recall and vice versa
The F-measure is the harmonic mean of precision and recall

Spearman's rank correlation

Spearman's rank correlation measures the rank correlation between two variables
- Rather than the values, measure how well the sorted order of items matches
Given two random variables \(X_i\) and \(Y_i\), sort items and assign ranks \(r_{X_i}\) and \(r_{Y_i}\) and calculate the correlation coefficient of the ranks
If ranks are distinct integers, can use the simplified formula \(\rho = 1 - \frac{6 \sum{d_i^2}}{n (n^2 - 1)}\) where \(d_i = r_{X_i} - r_{Y_i}\) is the difference between the ranks of observation \(i\)
Standard error is \(\sigma = \frac{0.6325}{\sqrt{n - 1}}\)

Proofs

These proofs are included primarily to help readers understand and remember a few key relationships.

Chebyshev's Inequality

\(P(\mid X - \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\)

\( \begin{align} \sigma^2 & = & \sum_x (x - \mu)^2 P(X) \ & \geq & \sum_{x : \mid x - \mu \mid \gt \epsilon} (x - \mu)^2 P(X) \ & \geq & \sum_{x : \mid x - \mu \mid \gt \epsilon} \epsilon^2 P(X) \ & = & \epsilon^2 \sum_{x : \mid x - \mu \mid \gt \epsilon} P(X) \ & = & \epsilon^2 P(\mid X - \mu \mid \gt \epsilon) \end{align} \)

Poisson as the limit to the binomial distribution

For binomial:

\( \begin{align} P(X = k) & = & \binom{n}{k} p^k (1 - p)^{n - k} \end{align} \)

Let \(\lambda = np\) be the success rate (number of trials times probability per trial)
Then \(p = \frac{\lambda}{n}\) and:

\( \begin{align} P(X = k) & = & \frac{n!}{k!(n - k)!} (\frac{\lambda}{n})^k (1 - \frac{\lambda}{n})^{n - k} \ & = & (\frac{\lambda^k}{k!}) \frac{n!}{(n - k)!} \frac{1}{n^k} (1 - \frac{\lambda}{n})^n (1 - \frac{\lambda}{n})^{- k} \end{align} \)

But:

\( \begin{align} \lim_{n \rightarrow \infty} \frac{n!}{(n - k)!} \frac{1}{n^k} & = & \lim_{n \rightarrow \infty} \frac{n(n - 1)(n - 2)\ldots(n - k + 1)}{n^k} \ & = & \lim_{n \rightarrow \infty} \frac{n}{n} \frac{n - 1}{n} \ldots \frac{n - k + 1}{n} \ & = & 1 \end{align} \)

And since \(e = \lim_{x \rightarrow \infty} (1 + \frac{1}{x})^x\), if \(\theta = -\frac{n}{\lambda}\) then:

\( \begin{align} \lim_{n \rightarrow \infty} (1 - \frac{\lambda}{n})^n & = & \lim_{n \rightarrow \infty} (1 + \frac{1}{\theta})^{-\lambda\theta} \ & = & e^{- \lambda} \ \end{align} \)

Finally:

\( \begin{align} \lim_{n \rightarrow \infty} (1 - \frac{\lambda}{n})^{-k} & = & 1 \end{align} \)

Multiplying these all together gives the formula for the Poisson distribution

Relationship between Poisson and exponential distributions

Let \(N_t\) be the number of events in time \(t\)
And \(X_t\) be the time for one additional event if there was an event at time \(t\)
If \(X_t \gt x\) then \(N_t = N_{t+x}\)
Since total probability is always 1, \(P(X_t \leq x) = 1 - P(X_t \gt x)\)
So \(P(X_t \leq x) = 1 - P(N_{t+x} - N_t = 0) = P(N_x = 0)\) (because the process is memoryless)
But \(P(N_x = 0) = \frac{(\lambda x)^0}{0!}e^{- \lambda x} = e^{- \lambda x}\)
So \(P(X_t \leq x) = 1 - e^{- \lambda x}\)

Bessel correction

For a sample \(a\), \(X\) deviates from sample mean \(\bar{X}\) with variance \(\sigma_{a}^2\)
But the sample mean \(\bar{X}\) deviates from the population mean \(\mu\) with variance \(\sigma^2 / n\)
So the population variance \(\sigma^2 = \sigma_{a}^2 + \sigma^2 / n\)
Rearranging gives \(\sigma^2 = \frac{n}{n-1} \sigma_{a}^2\)
But \(\sigma_{a}^2 = \frac{\sum (X_i - \bar{X})^2}{n}\), so \(\sigma^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}\)

Student's t distribution

The PDF of Student's t-distribution with \(\nu\) degrees of freedom is:

\( \begin{align} f(t) & = & \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi} \Gamma(\frac{\nu}{2})} (1+\frac{t^2}{\nu})^{-\frac{\nu+1}{2}} \end{align} \)

where \(\Gamma\) is the gamma function. For positive even integer values of \(\nu\), the first term is:

\( \begin{align} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu -1)(\nu -3)\cdots 5 \cdot 3}{2 \sqrt{\nu}(\nu -2)(\nu -4)\cdots 4 \cdot 2} \end{align} \)

For positive odd integer values of \(\nu\), the first term is:

\( \begin{align} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu -1)(\nu -3)\cdots 4 \cdot 2}{\pi \sqrt{\nu}(\nu -2)(\nu -4)\cdots 5 \cdot 3} \end{align} \)