Theory
A little theory can go a long way.
Basic probability
- Assume a set of events \(E\) and a probability function \(P\) with \(0 \leq P(x) \leq 1\)
- If \(A\) and \(B\) are disjoint sets of events from \(E\) then:
- \(P(E) = 1\) (i.e., something must happen)
- \(P(A \cup B) = P(A) + P(B)\) (i.e., probability of the union of disjoint sets is the sum of their probabilities)
- A few consequences:
- If \(A\) and \(B\) are not disjoint, then \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
- \(P(\neg A) = 1 - P(A)\)
- If \(A\) and \(B\) are disjoint, \(P(A \cup B) = P(A)P(B)\)
- The probability of \(n\) equally likely disjoint events \(x_i\) is \(nP(x_i)\)
Combinatorics
- There are \(n^{k}\) ways to sample \(k\) distinguishable items from \(n\) with replacement
- There are \(P(n, k) = \frac{n!}{(n - k)!}\) ways to sample \(k\) distinguishable items from \(n\) without replacement
- If the items are indistinguishable, the number of selections is
\(C(c, k) = \binom{c}{k} = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}\)
- Divide by \(k!\) because the indistinguishable items can be rearranged that many ways
Conditional probability
- The conditional probability \(P(A \mid B)\) of \(A\) given \(B\) is the probability of \(A\) occurring given that \(B\) is known to occur
- \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)
Bayes Rule
- Since \(P(A \mid B) = P(B \mid A)\), we have Bayes' Rule \(P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A)}\)
- Special case: since \(P(A) = P(A \mid B) + P(A \mid \neg B)\), \(P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A \mid B)P(B) + P(A \mid \neg B)P(\neg B)}\)
Mean and variance
- The mean \(\mu\) or expected value \(E(X)\) is the weighted sum of possible outcomes \(\sum_{x} xP(x)\)
- Implies \(E(aX + bY + c) = aE(X) + bE(Y) + c\)
- The variance \(\sigma^2\) is the expected value of the square of the difference between values and the mean
\(\sum_{x} (x - \mu)^2P(x)\)
- Squaring guarantees that values are positive
- And gives extra weight to outliers
- But the units are weird: "bugs squared"
- \(\sigma^{2}_{A + B} = \sigma^2_A + \sigma^2_B\) if \(A\) and \(B\) are independent
- The standard deviation \(\sigma\) is the square root of the variance
- Same units as original variable
Covariance and correlation
- Covariance \(\sigma_{XY}\) of \(X\) and \(Y\) is \(E((X - \mu_{X})(Y - \mu_{Y}))\)
- If \(X\) and \(Y\) are both above or below their means at the same time, \(\sigma_{XY}\) will be positive
- If \(X\) is above when \(Y\) is below and vice versa, \(\sigma_{XY}\) will be negative
- If there is no relation, \(\sigma_{XY}\) will be zero
- Pearson's correlation coefficient \(r_{XY}\) is covariance normalized by standard deviations
- \(r_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}\)
- Always lies in \([-1 \ldots 1]\)
- Chebyshev's Inequality: \(P(\mid X - \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\)
- I.e., the probability of a value being more than \(\epsilon\) away from the mean is bounded by the square of the ratio between the standard deviation and \(\epsilon\)
Bernoulli distribution
- A Bernoulli distribution is
a random variable with just two values 1 and 0 (sometimes called success and failure)
- Named after the mathematician who first described it
- Probability of success is \(p\)
- \(\mu = 0(1 - p) + 1(p) = p\)
- \(\sigma^2 = \sum (x - p)^2 P(x) = (0 - p)^2 (1 - p) + (1 - p)^2 p = p(1 - p)\)
Binomial distribution
- A binomial distribution is
the number of successes in \(n\) trials of a Bernoulli variable with probability \(p\)
- Name means "two numbers" (referring to \(n\) and \(k\))
- Probability of exactly \(x\) successes in \(n\) trials is \(\binom{n}{x} p^x (1-p)^{n-x}\)
- Number of different arrangements of that many successes and that many failures
- \(\mu = np\) (because trials are independent)
- \(\sigma^2 = np(1-p)\) (because \(\sigma^{2}_{A + B} = \sigma^2_A + \sigma^2_B\) if \(A\) and \(B\) are independent)
- Probability of up to \(x\) successes is complicated to calculate, but we can use approximations (FIXME)
Geometric distribution
- A geometric distribution is
the number of Bernoulli trials needed to get the first success
- Potentially
- \(P(x) = (1 - p)^{x - 1}p\) (i.e., \(x-1\) failures followed by 1 success)
- To find the mean, use the fact that \(\sum_{i=0}^{\infty} x^i = \frac{1}{1 - x}\) for \(0 \lt x \lt 1\)
- A geometric series, which gives the distribution its name
- \(\mu = 1/p\)
- The variance takes a little more work, but is \(\sigma^2\) is \(\frac{1-p}{p^2}\)
Negative binomial distribution
- The binomial distribution describes the numbr of successes in a fixed number of trials
- The negative binomial distribution is
the number of trials required to achieve a certain number of successes
- "Negative" in the sense of opposite: there is nothing negative in the values
- \(P(x) = \binom{x - 1}{k - 1} (1 - p)^{x - k} p^k\)
- Reading right to left, this is \(k\) successes, \(x-k\) failures, and the number of possible rearrangements with the last being a success (which is why the -1)
- \(\mu = \frac{k}{p}\)
- \(\sigma^2 = \frac{k(1-p)}{p^2}\)
- With \(k=1\), the negative binomial is just the geometric distribution (number of trials to first success)
Poisson distribution
- Number of events occurring within a fixed period has a Poisson distribution
- Assuming events never occur simultaneously
- \(P(x) = e^{- \lambda}\frac{\lambda^x}{x!}\)
- \(\mu = \lambda\)
- \(\sigma^2 = \lambda\)
- Poisson is a special case of binomial where the number of trials is very large and the probability of success in any trial is small
- For \(n \geq 30\) and \(p \leq 0.05\), Poisson is a good approximation of binomial
Probability density and cumulative distribution
- Function describing probabilities of discrete events is called the probability mass function
- When describing continuous events, use:
- Cumulative distribution function \(F(x) = P(X \leq x)\)
- Probability density function \(f(x) = dF/dx\)
- So \(P(a \lt X \lt B) = \int_{a}^{b} f(x) dx\)
- Require \(\int_{-\infty}^{\infty} f(x) dx = 1\)
- And notice \(P(x) = P(x \leq X \leq x) = \int_{x}^{x} f(x) dx = 0\)
- I.e., probability of any specific exact value is 0
- \(\mu = \int_{-\infty}^{\infty} x f(x) dx\)
- \(\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x) dx = \int_{-\infty}^{\infty} x^2 f(x) dx - \mu^2\)
Uniform distribution
- Uniform distribution has equal probability over a finite range \([a \ldots b]\)
- \(f(x) = \frac{1}{b - a}\)
- \(P(a \leq t \leq X \leq t+h \leq b) = \frac{h}{b - a}\)
- I.e., probability is proportional to fraction of range
- Standard uniform distribution has range \([0 \ldots 1]\)
- \(\mu = \frac{1}{2}\)
- \(\sigma^2 = \int_{0}^1 x^2 dx - (\frac{1}{2})^2 = \frac{1}{12}\)
Exponential distribution
- Assume the waiting time for the next rare event is independent of the time waited so far
- I.e., \(P(X \geq b+a \mid X \geq a) = P(X \geq b)\)
- Can show that \(f(x) = \lambda e^{- \lambda x}\) is the unique solution, so \(F(x) = 1 - e^{- \lambda x}\)
- \(\mu = \frac{1}{\lambda}\)
- \(\sigma^2 = \frac{1}{\lambda^2}\)
- The parameter \(\lambda\) is the frequency
- If \(\lambda = 2\) then we expect 2 events per unit time and \(\mu = \frac{1}{\lambda} = 0.5\) units of time between events
- The Poisson distribution with parameter \(\lambda\) is the number of events in time \(t\) (discrete)
Gamma distribution
- If there are \(\alpha\) independent sequential steps,
each taking an exponentially-distributed time with rate \(\lambda\),
then the total time has a Gamma distribution Gamma(\(\alpha\), \(\lambda\))
- Name comes from the gamma function, which extends factorial to complex numbers
- \(f(x) = \frac{\lambda^\alpha}{\Gamma (\alpha)} x^{\alpha - 1} e^{- \lambda x}\)
- \(\alpha\) is sometimes called the shape parameter
- Gamma(1, \(\lambda\)) is an exponential distribution with rate \(\lambda\)
- \(\mu = \frac{\alpha}{\lambda}\)
- \(\sigma^2 = \frac{\alpha}{\lambda^2}\)
- For specific values, use library or lookup tables
- FIXME: relationship between Gamma and Poisson
Normal distribution
- In its full glory, normal distribution
has \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}\)
- There is no closed form for the integral \(F(x)\)
- But as the notation suggests, means is \(\mu\) and variance is \(\sigma^2\)
- The standard normal distribution \(Z\) has mean \(\mu = 0\) and standard deviation \(\sigma = 1\)
- Reconstruct arbitrary distribution \(X = \mu + \sigma Z\)
Central Limit Theorem
- Let \(S_n = X_1 + X_2 + \ldots + X_n\) be the sum of \(n\) independent random variables, all with mean \(\mu\) and standard deviation \(\sigma\)
- As \(n \rightarrow \infty\), \(\frac{S_n - n\mu}{\sigma \sqrt{n}}\) converges on a standard normal random variable
- Rate of convergence is \(\frac{1}{\sqrt{n}}\)
- Heuristic: for \(n \gt 30\), a normal distribution is an accurate approximation to \(S_n\)
- In particular, binomial(\(n\), \(p\)) \(\approx\) normal(\(\mu = np\), \(\sigma = \sqrt{n p (1 - p)}\) )
- But use a continuity correction: adjust interval by 0.5 units to allow for difference between discrete and continuous
- E.g., \(P(X = x) = P(x - 0.5 \lt X \lt x + 0.5)\) (because the probability of a point in a continuous distribution is zero)
Sampling
- A population has a parameter; a sample has a statistic
- Sample mean \(\bar{X}\) estimates the population mean \(\mu = E(X)\)
- Variance of \(\bar{X}\) is \(\frac{\sigma^2}{n}\)
- Distribution of sample means is normal, i.e. \(\frac{\bar{X} - \mu}{\sigma \sqrt{n}}\) is standard normal as \(n \rightarrow \infty\)
- Regardless of the underlying distribution of \(X\)
- The median is the central value such that \(P(X \lt M) \leq 0.5\) and \(P(X \gt M) \leq 0.5\)
- If the distribution is skewed to the right, \(M \lt \mu\)
- If the distribution is skewed to the left, \(M \gt \mu\)
- A quartile is a value that divides the sample into quarters
- E.g., first quartile \(Q_1\) splits values 25%/75%, third quartile \(Q_3\) splits values 75%/25%
- Second quartile is the same as the median
- The interquartile range IQR is \(Q_3 - Q_1\)
- Anything more than 1.5 IQR below \(Q_1\) or above \(Q_3\) is considered an outlier
- Because if the data is normal, only 0.7% of it should lie outside these bounds
- The sample variance is:
\( \begin{align} s^2 & = & \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})^2 \ & = & \frac{\sum X_i^2 - n\bar{X}^2}{n - 1} \end{align} \)
- Using \(n-1\) instead of \(n\) ensures that \(s^2\) is unbiased (the Bessel correction)
Parameter estimation
- The \(k^{th}\) population moment is \(E(X^k)\)
- The \(k^{th}\) sample moment is \(\frac{1}{n}\sum_{i=1}^{n} X_i^k\)
- The first sample moment is the sample mean
- The population and sample central moments are \(E(X - \mu)^k\) and \(\frac{1}{n}\sum(X_i - \bar{X})^k\) respectively
- I.e., the moments after shifting the data to a mean of zero
- Parameter estimation via the method of moments: find parameter values to match sample moments
- Parameter estimation via maximum likelihood: choose parameter values to maximize the likelihood of the observed sample
- Example: estimation \(\lambda\) for Poisson distribution
- PMF is \(P(x) = e^{- \lambda} \frac{\lambda^x}{x!}\)
- Taking logarithms, \(\ln{P(x)} = -\lambda + x \ln{\lambda} - \ln{x!}\)
- So we need to maximize \(\ln{P(x)} = \sum{(\lambda + X_i \ln{\lambda})} + C = -n \lambda + \ln{\lambda}\sum{X_i} + C\) where \(C = -\sum{\ln{x!}}\) is a constant that does not depend on \(\lambda\)
- Differentiating with respect to \(\lambda\) and setting equal to zero gives \(\frac{\partial}{\partial{\lambda}}\ln{P(X)} = -n + \frac{1}{\lambda}\sum{X_i} = 0\)
- Only solution is \(\lambda = \frac{1}{n}\sum{X_i} = \bar{X}\)
Student's t-distribution
- Student's t-distribution is used to estimate the mean of a normally distributed population
when the sample size is small (e.g., less 30) and the variance is unknown
- Named comes from a pseudonym used by the mathematician who first used it this way
- If \(X\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\), then \(\bar{X} = \frac{1}{n}\sum{X_i}\) is the sample mean and \(s^2 = \frac{1}{n-1}\sum{(X_i - X)^2}\) is the Bessel-corrected sample variance
- The variable \(\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}\) has a standard normal distribution
- However, the variable \(\frac{\bar{X} - \mu}{s / \sqrt{n}}\) has a t-distribution
with \(n-1\) degrees of freedom
- \(n-1\) because there's a step in the calculation that normalizes the \(n\) values to unit length
- Once \(n-1\) are known, the value of the \(n^{th}\) is fixed
- The exact formula for the t-distribution is a little bit scary
- The PDF's shape resembles that of a normal distribution with mean 0 and variance 1, but is slightly lower and wider.
- The two become closer as the degrees of freedom \(\nu\) gets larger.
Confidence intervals
- A confidence interval is an interval \([a \ldots b]\)
that has some probability \(p\) of containing the actual value of a statistic
- E.g., "There is a 90% probability that the actual mean of this population lies between 2.5 and 3.5"
- Larger intervals have a higher probability but are less precise
- If there are more than 30 samples or the standard deviation \(\sigma\) is known, use a z-test:
- Choose a confidence level \(C\) (typically 95%)
- Find the value \(z^{\star}\) such that \(P(x \leq z^{\star}) \leq \frac{1 - C}{2}\)
in a standard normal distribution
- Divide by 2 because the normal curve has two symmetric tails
- Calculate the sample mean \(\bar{X}\)
- Interval is \(\bar{X} \pm z^{\star}\frac{\sigma}{\sqrt{n}}\)
- If there are fewer than 30 samples or the standard deviation isn't known, use a t-test:
- Choose a confidence level \(C\)
- Find a value \(t^{\star}\) such that \(P(x \leq t^{\star}) \leq \frac{1 - C}{2}\) in a Student's t-distribution with \(n-1\) degrees of freedom
- Estimate the standard deviation \(s\)
- Interval is \(\bar{X} \pm t^{\star}\frac{s}{\sqrt{n}}\)
Hypothesis testing
- What is the probability of seeing this difference between two datasets?
- The null hypothesis \(H_0\) is that the samples come from a single population and the observed difference is purely due to chance
- The alternative hypothesis \(H_A\) is that the samples come from two difference populations
- False positive: decide that the difference is not purely random when it is
- False negative: decide the difference is purely random when it isn't
- Example: if a coin comes up heads 9 times out of 10, what are the odds it is actually fair?
- Probability if the coin is fair is \(\binom{10}{9} \cdot 0.5^9 \cdot (1-0.5)^1 = 10 \cdot 0.00195 \cdot 0.5 = 0.00976\)
- I.e., less than 1% chance of seeing this result if the coin is far
Prediction
- Accuracy is the fraction of correct predictions (true positive + true negative)
- Not useful if there are only a few defective items, since "all good" will have high accuracy
- Precision is the fraction of positives that are actually positive, i.e. true positive / (true positive + false positive)
- Recall is the fraction of positives the method can actually identify, i.e., true positive / (true positive + false negative)
- Perfect precision and perfect recall mean all items are classified correctly
- But increasing precision often reduces recall and vice versa
- The F-measure is the harmonic mean of precision and recall
Spearman's rank correlation
- Spearman's rank correlation measures the rank correlation between two variables
- Rather than the values, measure how well the sorted order of items matches
- Given two random variables \(X_i\) and \(Y_i\), sort items and assign ranks \(r_{X_i}\) and \(r_{Y_i}\) and calculate the correlation coefficient of the ranks
- If ranks are distinct integers, can use the simplified formula \(\rho = 1 - \frac{6 \sum{d_i^2}}{n (n^2 - 1)}\) where \(d_i = r_{X_i} - r_{Y_i}\) is the difference between the ranks of observation \(i\)
- Standard error is \(\sigma = \frac{0.6325}{\sqrt{n - 1}}\)
Proofs
These proofs are included primarily to help readers understand and remember a few key relationships.
Chebyshev's Inequality
\(P(\mid X - \mu \mid \gt \epsilon) \leq (\frac{\sigma}{\epsilon})^2\)
\( \begin{align} \sigma^2 & = & \sum_x (x - \mu)^2 P(X) \ & \geq & \sum_{x : \mid x - \mu \mid \gt \epsilon} (x - \mu)^2 P(X) \ & \geq & \sum_{x : \mid x - \mu \mid \gt \epsilon} \epsilon^2 P(X) \ & = & \epsilon^2 \sum_{x : \mid x - \mu \mid \gt \epsilon} P(X) \ & = & \epsilon^2 P(\mid X - \mu \mid \gt \epsilon) \end{align} \)
Poisson as the limit to the binomial distribution
- For binomial:
\( \begin{align} P(X = k) & = & \binom{n}{k} p^k (1 - p)^{n - k} \end{align} \)
- Let \(\lambda = np\) be the success rate (number of trials times probability per trial)
- Then \(p = \frac{\lambda}{n}\) and:
\( \begin{align} P(X = k) & = & \frac{n!}{k!(n - k)!} (\frac{\lambda}{n})^k (1 - \frac{\lambda}{n})^{n - k} \ & = & (\frac{\lambda^k}{k!}) \frac{n!}{(n - k)!} \frac{1}{n^k} (1 - \frac{\lambda}{n})^n (1 - \frac{\lambda}{n})^{- k} \end{align} \)
- But:
\( \begin{align} \lim_{n \rightarrow \infty} \frac{n!}{(n - k)!} \frac{1}{n^k} & = & \lim_{n \rightarrow \infty} \frac{n(n - 1)(n - 2)\ldots(n - k + 1)}{n^k} \ & = & \lim_{n \rightarrow \infty} \frac{n}{n} \frac{n - 1}{n} \ldots \frac{n - k + 1}{n} \ & = & 1 \end{align} \)
- And since \(e = \lim_{x \rightarrow \infty} (1 + \frac{1}{x})^x\), if \(\theta = -\frac{n}{\lambda}\) then:
\( \begin{align} \lim_{n \rightarrow \infty} (1 - \frac{\lambda}{n})^n & = & \lim_{n \rightarrow \infty} (1 + \frac{1}{\theta})^{-\lambda\theta} \ & = & e^{- \lambda} \ \end{align} \)
- Finally:
\( \begin{align} \lim_{n \rightarrow \infty} (1 - \frac{\lambda}{n})^{-k} & = & 1 \end{align} \)
- Multiplying these all together gives the formula for the Poisson distribution
Relationship between Poisson and exponential distributions
- Let \(N_t\) be the number of events in time \(t\)
- And \(X_t\) be the time for one additional event if there was an event at time \(t\)
- If \(X_t \gt x\) then \(N_t = N_{t+x}\)
- Since total probability is always 1, \(P(X_t \leq x) = 1 - P(X_t \gt x)\)
- So \(P(X_t \leq x) = 1 - P(N_{t+x} - N_t = 0) = P(N_x = 0)\) (because the process is memoryless)
- But \(P(N_x = 0) = \frac{(\lambda x)^0}{0!}e^{- \lambda x} = e^{- \lambda x}\)
- So \(P(X_t \leq x) = 1 - e^{- \lambda x}\)
Bessel correction
- For a sample \(a\), \(X\) deviates from sample mean \(\bar{X}\) with variance \(\sigma_{a}^2\)
- But the sample mean \(\bar{X}\) deviates from the population mean \(\mu\) with variance \(\sigma^2 / n\)
- So the population variance \(\sigma^2 = \sigma_{a}^2 + \sigma^2 / n\)
- Rearranging gives \(\sigma^2 = \frac{n}{n-1} \sigma_{a}^2\)
- But \(\sigma_{a}^2 = \frac{\sum (X_i - \bar{X})^2}{n}\), so \(\sigma^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}\)
Student's t distribution
The PDF of Student's t-distribution with \(\nu\) degrees of freedom is:
\( \begin{align} f(t) & = & \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi} \Gamma(\frac{\nu}{2})} (1+\frac{t^2}{\nu})^{-\frac{\nu+1}{2}} \end{align} \)
where \(\Gamma\) is the gamma function. For positive even integer values of \(\nu\), the first term is:
\( \begin{align} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu -1)(\nu -3)\cdots 5 \cdot 3}{2 \sqrt{\nu}(\nu -2)(\nu -4)\cdots 4 \cdot 2} \end{align} \)
For positive odd integer values of \(\nu\), the first term is:
\( \begin{align} \frac{\Gamma(\frac{\nu+1}{2})} {\sqrt{\nu\pi}\,\Gamma(\frac{\nu}{2})} & = & \frac{(\nu -1)(\nu -3)\cdots 4 \cdot 2}{\pi \sqrt{\nu}(\nu -2)(\nu -4)\cdots 5 \cdot 3} \end{align} \)