Statistics

This page pairs well with Probability.

Table of Distributions

Distributionmass/density function\[S_X\]\[\mathbb{E}(X)\]\[\mathrm{Var}(X)\]\[\phi_X(s)\]
Bernoulli \[\mathrm{Bern}(\pi)\]\[P(X=1)=\pi\\ P(X=0)=1-\pi\]\[\{0,1\}\]\[\pi\]\[\pi(1-\pi)\]\[(1-\pi)+\pi e^{s}\]
Binomial \[\mathrm{Bin}(n,\pi)\]\[p_X(x)=\binom{n}{x}\pi^{x}(1-\pi)^{n-x}\]\[\{0,1,\dots,n\}\]\[n\pi\]\[n\pi(1-\pi)\]\[(1-\pi+\pi e^{s})^{n}\]
Geometric \[\mathrm{Geo}(\pi)\]\[p_X(x)=\pi(1-\pi)^{x-1}\]\[\{1,2,\dots\}\]\[\pi^{-1}\]\[(1-\pi)\pi^{-2}\]\[\frac{\pi}{e^{-s}-1+\pi}\]
Poisson \[\mathcal{P}(\lambda)\]\[p_X(x)=e^{-\lambda}\lambda^{x}/x!\]\[\{0,1,\dots\}\]\[\lambda\]\[\lambda\]\[\exp\{\lambda(e^{s}-1)\}\]
Uniform \[U[\alpha,\beta]\]\[f_X(x)=(\beta-\alpha)^{-1}\]\[[\alpha,\beta]\]\[\frac{1}{2}(\alpha+\beta)\]\[\frac{1}{12}(\beta-\alpha)^2\]\[\frac{e^{\beta s}-e^{\alpha s}}{s(\beta-\alpha)}\]
Exponential \[\mathrm{Exp}(\lambda)\]\[f_X(x)=\lambda e^{-\lambda x}\]\[[0,\infty)\]\[\lambda^{-1}\]\[\lambda^{-2}\]\[\frac{\lambda}{\lambda-s}\]
Gaussian \[\mathcal{N}(\mu,\sigma^{2})\]\[f_X(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{-\frac{(x-\mu)^2}{2\sigma^{2}}\right\}\]\[\mathbb{R}\]\[\mu\]\[\sigma^{2}\]\[e^{\mu s+\frac{1}{2}\sigma^{2}s^{2}}\]
Gamma \[\Gamma(\alpha,\lambda)\]\[f_X(x)=\frac{1}{\Gamma(\alpha)}\lambda^{\alpha}x^{\alpha-1}e^{-\lambda x}\]\[[0,\infty)\]\[\alpha\lambda^{-1}\]\[\alpha\lambda^{-2}\]\[\left(\frac{\lambda}{\lambda-s}\right)^{\alpha}\]

Statistical Inference

Definition (Random Sample & Model)

Let \(X=(X_1,\ldots,X_n)\) be i.i.d. from a parametric family \(\{F_\theta:\theta\in\Theta\subset\mathbb{R}^p\}\). The parameter \(\theta\) is unknown; inference uses the randomness of \(X\) to learn about \(\theta\).

Definition (Estimator / Estimate)

An estimator of \(\theta\) is any statistic \(T_n=T(X)\). Its (random) distribution is the sampling distribution; a realised value is an estimate.

Definition (Bias, Variance & Standard Error)

\(\mathrm{Bias}(T_n)=\mathbb{E}_\theta[T_n]-\theta.\) \(\mathrm{Var}(T_n)=\mathbb{E}_\theta[(T_n-\mathbb{E}_\theta T_n)^2]\). The standard error is \(\mathrm{se}(T_n)=\sqrt{\mathrm{Var}(T_n)}\).

Definition (Mean Squared Error (MSE))

\(\mathrm{MSE}(T_n)=\mathbb{E}_\theta[(T_n-\theta)^2]=\mathrm{Var}(T_n)+\mathrm{Bias}(T_n)^2.\)

Definition (Consistency)

\(T_n\) is consistent for \(\theta\) if \(T_n \xrightarrow{P} \theta\) as \(n\to\infty\). (LLN and Continuous Mapping are the typical engines behind this.)

Definition (Fisher Information)

For a regular model with pdf/pmf \(f_\theta\), the (per-observation) Fisher information is \[I(\theta)=\mathrm{Var}_\theta\!\left[\dfrac{\partial}{\partial\theta}\log f_\theta(X)\right] = -\mathbb{E}_\theta\!\left[\dfrac{\partial^2}{\partial\theta^2}\log f_\theta(X)\right].\]

Theorem (Cramér–Rao Lower Bound (CRLB))

Under regularity conditions, any unbiased estimator \(T_n\) of a scalar function \(g(\theta)\) satisfies \[ \mathrm{Var}_\theta(T_n)\ \ge\ \frac{\big(g’(\theta)\big)^2}{n\,I(\theta)}. \] Hence \(1/(nI(\theta))\) is the information-limited variance bound for unbiased estimation of \(\theta\) (i.e. \(g(\theta)=\theta\)).

Remark

As \(n\) grows, asymptotics (LLN/CLT) typically dominate finite-sample quirks; large-sample normal approximations are the workhorse of practical inference.

Estimation Methods

Definition (Method of Moments (MoM))

Suppose \(\theta=(\theta_1,\ldots,\theta_p)\) can be written as \[ \theta_k=\beta_k\big(\mathbb{E}(X),\mathbb{E}(X^2),\ldots,\mathbb{E}(X^K)\big),\quad k=1,\ldots,p, \] for continuous \(\beta_k\) and finite moments up to order \(2K\). Replace expectations by sample moments to get \(\widehat\theta_k=\beta_k(\widehat M_1,\ldots,\widehat M_K)\). Under these conditions the MoM estimator is consistent.

Theorem (MoM Consistency (sketch))

If \(\mathrm{Var}(X^k)<\infty\) for \(k\le K\), then \(\widehat M_k \xrightarrow{P}\mathbb{E}(X^k)\) by the LLN; continuity of \(\beta_k\) gives \(\widehat\theta_k\xrightarrow{P}\theta_k\) by the Continuous Mapping Theorem.

Example (Rayleigh)

For \(f_\theta(x)=2\theta x e^{-\theta x^2}\,1_{\{x\ge0\}}\), we have \(\mathbb{E}(X)=\frac{\sqrt{\pi}}{2\sqrt{\theta}}\), so the MoM estimator is \[ \widehat\theta_{\text{MoM}}=\frac{\pi}{4}\,\overline X^{\, -2}. \] Its asymptotic distribution follows by the CLT and Delta Method.

Definition (Maximum Likelihood (MLE))

For i.i.d. \(X_1,\dots,X_n\), the likelihood is \(L_X(\theta)=\prod_{i=1}^n f_\theta(X_i)\) (discrete: product of pmf’s). The MLE is \(\widehat\theta=\arg\max_{\theta\in\Theta}L_X(\theta)\). Equivalently, maximise the log-likelihood \(\ell_X(\theta)=\sum_{i=1}^n\log f_\theta(X_i)\).

Remark (Score Equation)

Interior maxima solve the score equation: \(\partial \ell_X(\theta)/\partial\theta=0\).

Theorem (Asymptotic Normality of the MLE)

Under standard regularity and \(0<I(\theta)<\infty\), \[ \sqrt{n\,I(\theta)}\big(\widehat\theta-\theta\big)\ \xrightarrow{d}\ \mathcal{N}(0,1), \] so \(\widehat\theta \approx \mathcal{N}\!\left(\theta,\,[nI(\theta)]^{-1}\right)\) for large \(n\). Consequently, MLEs are asymptotically unbiased and efficient (achieve the CRLB).

Example (Bernoulli((,\pi)))

\(\ell(\pi)=\big(\sum X_i\big)\log\pi+(n-\sum X_i)\log(1-\pi)\Rightarrow \widehat\pi=\frac{1}{n}\sum X_i.\)

Theorem (Normal Sample: (\chi^2) and (t) Facts)

Let \(X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}N(\mu,\sigma^2)\), \(\overline X=\frac1n\sum X_i\), and \(S^2=\frac1{n-1}\sum (X_i-\overline X)^2\). Then \[ \frac{\overline X-\mu}{S/\sqrt{n}}\ \sim\ t_{n-1},\qquad \frac{(n-1)S^2}{\sigma^2}\ \sim\ \chi^2_{\,n-1}. \]

Definition (Fisher Score & Information (sample))

Score: \(S_n(\theta)=\ell’_n(\theta)\). Fisher information: \(I_n(\theta)=-\mathbb E_\theta[\ell’’_n(\theta)]\). Properties: \(\mathbb E_\theta[S_n(\theta)]=0\), \(\mathrm{Var}_\theta(S_n(\theta))=I_n(\theta)\).

Theorem (Asymptotic Variance of the MLE)

For the MLE \(\hat\theta_n\), \(I_n(\theta)\,\mathrm{Var}_\theta(\hat\theta_n)\ \xrightarrow{P}\ 1\). Equivalently, \(\mathrm{se}(\hat\theta_n)\approx I_n(\theta)^{-1/2}\).

Theorem (Delta Method)

If \(\sqrt{n}\,(\hat\theta-\theta)\Rightarrow N(0,\sigma^2)\) and \(g’(\theta)\neq0\), then \[ \sqrt{n}\,\frac{g(\hat\theta)-g(\theta)}{\sigma\,g’(\theta)}\ \Rightarrow\ N(0,1). \]

Theorem (Extended Delta Method)

If \(g’(\theta)=0\) but \(g^{(k)}(\theta)\neq0\) for the smallest \(k\ge2\), then \[ \frac{g(\hat\theta)-g(\theta)}{(\sigma/\sqrt{n})^{k}}\ \Rightarrow\ \frac{1}{k!}\,g^{(k)}(\theta)\,Z^{k},\quad Z\sim N(0,1). \]

Theorem (Multivariate Fisher Information & Delta)

For \(\boldsymbol\theta\in\mathbb R^{p}\), \(I_n(\boldsymbol\theta)=-\mathbb E[H_{\boldsymbol\theta}\ell_n]\) (Hessian). If \(\hat{\boldsymbol\theta}\) is the MLE and \(g:\mathbb R^{p}\to\mathbb R\) is differentiable, then \[ \frac{g(\hat{\boldsymbol\theta})-g(\boldsymbol\theta)}{\ \sqrt{\nabla g(\hat{\boldsymbol\theta})^{\!\top} I_n(\hat{\boldsymbol\theta})^{-1}\nabla g(\hat{\boldsymbol\theta})}\ }\ \Rightarrow\ N(0,1). \]

Confidence Intervals

Definition (Confidence Interval (CI))

A \(100(1-\alpha)\%\) confidence interval for \(\theta\) is a random interval \(C_\alpha(X)\) such that \(\mathbb{P}_\theta\!\big(\theta\in C_\alpha(X)\big)=1-\alpha\). The probability is over the sample \(X\); the parameter is fixed.

Definition (CI via Pivot / Asymptotics)

If a statistic \(T_n=T_n(X,\theta)\) has known/null distribution independent of \(\theta\), use its quantiles. More generally, if \(T_n\approx\mathcal{N}(0,1)\) for large \(n\), then a two-sided \(100(1-\alpha)\%\) CI is \[ \widehat\theta\ \pm\ z_{\alpha/2}\,\mathrm{se}(\widehat\theta), \] with \(z_{\alpha/2}\) the standard normal quantile.

Remark (Normal Mean ((\sigma) known))

If \(X_i\stackrel{\text{iid}}{\sim}\mathcal{N}(\mu,\sigma^2)\) with known \(\sigma\), then \[ \frac{\overline X-\mu}{\sigma/\sqrt{n}}\sim\mathcal{N}(0,1)\quad\Rightarrow\quad \mu\in\Big[\ \overline X\ \pm\ z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}}\ \Big]. \] (Quantiles \(z_{0.95}=1.645\), \(z_{0.975}=1.96\), \(z_{0.995}=2.575\).)

Remark (Wald CI from MLE)

From \(\widehat\theta\approx\mathcal{N}\!\big(\theta,[nI(\theta)]^{-1}\big)\), \[ \theta\in\Big[\ \widehat\theta\ \pm\ z_{\alpha/2}\ \sqrt{\tfrac{1}{nI(\widehat\theta)}}\ \Big] \] is an approximate \(100(1-\alpha)\%\) CI.

Remark (Visual: central (1-\alpha) mass under (N(0,1)))
Remark (t-CI for a Normal Mean ((\sigma) unknown))

If \(X_i\sim N(\mu,\sigma^2)\) with unknown \(\sigma\), then a \(100(1-\alpha)\%\) CI for \(\mu\) is \[ \overline X\ \pm\ t_{n-1,\,1-\alpha/2}\,\frac{S}{\sqrt{n}}, \] where \(t_{n-1,\,1-\alpha/2}\) is the upper \((1-\alpha/2)\)-quantile of \(t_{n-1}\).

Hypothesis Testing

Definition (Hypotheses & Test)

We formalise a claim about \(\theta\) as \[ H_0:\theta\in\Theta_0 \quad\text{vs}\quad H_1:\theta\in\Theta_1, \] choose a statistic \(S(X)\) and reject \(H_0\) when \(S(X)\) falls in a critical region \(C\).

Definition (Type I Error, Significance)

Type I error: rejecting \(H_0\) when \(H_0\) is true. Its probability \(\alpha=\mathbb{P}_\theta(S\in C)\) (for \(\theta\in\Theta_0\)) is the significance level.

Definition (Type II Error & Power)

Type II error: failing to reject \(H_0\) when \(H_1\) is true; probability \(\beta(\theta)\) (for \(\theta\in\Theta_1\)). The power function is \(\pi(\theta)=1-\beta(\theta)=\mathbb{P}_\theta(S\in C)\) for \(\theta\in\Theta_1\).

Definition (p-value)

For observed \(x\), the p-value is the smallest \(\alpha\) for which \(x\) lies in a level-\(\alpha\) rejection region (equivalently, the tail probability under \(H_0\) of outcomes as or more extreme than \(x\)).

Remark (Z-test for a Normal Mean (σ known))

With \(X_i\stackrel{\text{iid}}{\sim}\mathcal{N}(\mu,\sigma^2)\), test \(H_0:\mu=\mu_0\) using \[ Z=\frac{\overline X-\mu_0}{\sigma/\sqrt{n}}\sim\mathcal{N}(0,1)\ \text{ under }H_0. \] Reject for \(|Z|>z_{\alpha/2}\) (two-sided); report \(p=2\{1-\Phi(|z_{\text{obs}}|)\}\).

Remark (Visual: (\alpha) vs (\beta))
Remark (t-tests for a Normal Mean ((\sigma) unknown))

Test \(H_0:\mu=\mu_0\) with \[ T=\frac{\overline X-\mu_0}{S/\sqrt{n}}\ \sim\ t_{n-1}\ (H_0). \] Right-tail: reject if \(T>t_{1-\alpha,n-1}\); left-tail: \(T<t_{\alpha,n-1}\); two-sided: \(|T|>t_{1-\alpha/2,n-1}\). p-values use the corresponding \(t_{n-1}\) tails; two-sided \(p=2\{1-F_{t_{n-1}}(|t_{\mathrm{obs}}|)\}\).

Definition (Wald Test (general))

For parameter \(\theta\) with MLE \(\hat\theta\) and \(\widehat{\mathrm{se}}(\hat\theta)\), \[ W=\frac{\hat\theta-\theta_0}{\widehat{\mathrm{se}}(\hat\theta)}\ \approx\ N(0,1). \] Right-tail: reject \(H_0:\theta\le\theta_0\) if \(W>z_{1-\alpha}\); left-tail: \(W<z_{\alpha}\); two-sided: \(|W|>z_{1-\alpha/2}\).

Definition (Generalised Likelihood Ratio Test (GLRT))

For \(H_0:\theta\in\Theta_0\) vs \(H_1:\theta\in\Theta\setminus\Theta_0\), \[ \Lambda(x)=\frac{\sup_{\theta\in\Theta_0}L(\theta)}{\sup_{\theta\in\Theta}L(\theta)},\qquad \text{reject for small }\Lambda. \] Critical constant chosen to give size \(\alpha\).

Definition (Power and Size)

Power \(\pi(\theta)=\mathbb P_\theta(\text{reject }H_0)\). Size (significance) \(\alpha\) satisfies \(\displaystyle \sup_{\theta\in\Theta_0}\pi(\theta)\le \alpha\).