Statistics
This page pairs well with Probability.
Statistical Inference
Definition
(Random Sample & Model)
Let \(X=(X_1,\ldots,X_n)\) be i.i.d. from a parametric family \(\{F_\theta:\theta\in\Theta\subset\mathbb{R}^p\}\).
The parameter \(\theta\) is unknown; inference uses the randomness of \(X\) to learn about \(\theta\).
Definition
(Estimator / Estimate)
An estimator of \(\theta\) is any statistic \(T_n=T(X)\). Its (random) distribution is the sampling distribution; a realised value is an estimate.
Definition
(Bias, Variance & Standard Error)
\(\mathrm{Bias}(T_n)=\mathbb{E}_\theta[T_n]-\theta.\)
\(\mathrm{Var}(T_n)=\mathbb{E}_\theta[(T_n-\mathbb{E}_\theta T_n)^2]\).
The standard error is \(\mathrm{se}(T_n)=\sqrt{\mathrm{Var}(T_n)}\).
Definition
(Mean Squared Error (MSE))
\(\mathrm{MSE}(T_n)=\mathbb{E}_\theta[(T_n-\theta)^2]=\mathrm{Var}(T_n)+\mathrm{Bias}(T_n)^2.\)
Definition
(Consistency)
\(T_n\) is consistent for \(\theta\) if \(T_n \xrightarrow{P} \theta\) as \(n\to\infty\).
(LLN and Continuous Mapping are the typical engines behind this.)
Definition
(Fisher Information)
For a regular model with pdf/pmf \(f_\theta\), the (per-observation) Fisher information is
\[I(\theta)=\mathrm{Var}_\theta\!\left[\dfrac{\partial}{\partial\theta}\log f_\theta(X)\right] = -\mathbb{E}_\theta\!\left[\dfrac{\partial^2}{\partial\theta^2}\log f_\theta(X)\right].\]
Theorem
(Cramér–Rao Lower Bound (CRLB))
Under regularity conditions, any unbiased estimator \(T_n\) of a scalar function \(g(\theta)\) satisfies
\[
\mathrm{Var}_\theta(T_n)\ \ge\ \frac{\big(g'(\theta)\big)^2}{n\,I(\theta)}.
\]
Hence \(1/(nI(\theta))\) is the information-limited variance bound for unbiased estimation of \(\theta\) (i.e. \(g(\theta)=\theta\)).
Remark
As \(n\) grows, asymptotics (LLN/CLT) typically dominate finite-sample quirks; large-sample normal approximations are the workhorse of practical inference.
Estimation Methods
Definition
(Method of Moments (MoM))
Suppose \(\theta=(\theta_1,\ldots,\theta_p)\) can be written as
\[
\theta_k=\beta_k\big(\mathbb{E}(X),\mathbb{E}(X^2),\ldots,\mathbb{E}(X^K)\big),\quad k=1,\ldots,p,
\]
for continuous \(\beta_k\) and finite moments up to order \(2K\). Replace expectations by sample moments to get \(\widehat\theta_k=\beta_k(\widehat M_1,\ldots,\widehat M_K)\). Under these conditions the MoM estimator is consistent.
Theorem
(MoM Consistency (sketch))
If \(\mathrm{Var}(X^k)<\infty\) for \(k\le K\), then \(\widehat M_k \xrightarrow{P}\mathbb{E}(X^k)\) by the LLN; continuity of \(\beta_k\) gives \(\widehat\theta_k\xrightarrow{P}\theta_k\) by the Continuous Mapping Theorem.
Example
(Rayleigh)
For \(f_\theta(x)=2\theta x e^{-\theta x^2}\,1_{\{x\ge0\}}\), we have \(\mathbb{E}(X)=\frac{\sqrt{\pi}}{2\sqrt{\theta}}\), so the MoM estimator is
\[
\widehat\theta_{\text{MoM}}=\frac{\pi}{4}\,\overline X^{\, -2}.
\]
Its asymptotic distribution follows by the CLT and Delta Method.
Definition
(Maximum Likelihood (MLE))
For i.i.d. \(X_1,\dots,X_n\), the likelihood is \(L_X(\theta)=\prod_{i=1}^n f_\theta(X_i)\) (discrete: product of pmf’s).
The MLE is \(\widehat\theta=\arg\max_{\theta\in\Theta}L_X(\theta)\).
Equivalently, maximise the log-likelihood \(\ell_X(\theta)=\sum_{i=1}^n\log f_\theta(X_i)\).
Remark
(Score Equation)
Interior maxima solve the score equation: \(\partial \ell_X(\theta)/\partial\theta=0\).
Theorem
(Asymptotic Normality of the MLE)
Under standard regularity and \(0<I(\theta)<\infty\),
\[
\sqrt{n\,I(\theta)}\big(\widehat\theta-\theta\big)\ \xrightarrow{d}\ \mathcal{N}(0,1),
\]
so \(\widehat\theta \approx \mathcal{N}\!\left(\theta,\,[nI(\theta)]^{-1}\right)\) for large \(n\).
Consequently, MLEs are asymptotically unbiased and efficient (achieve the CRLB).
Example
(Bernoulli(\(\,\pi\)))
\(\ell(\pi)=\big(\sum X_i\big)\log\pi+(n-\sum X_i)\log(1-\pi)\Rightarrow \widehat\pi=\frac{1}{n}\sum X_i.\)
Theorem
(Normal Sample: \(\chi^2\) and \(t\) Facts)
Let \(X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}N(\mu,\sigma^2)\), \(\overline X=\frac1n\sum X_i\), and \(S^2=\frac1{n-1}\sum (X_i-\overline X)^2\). Then
\[
\frac{\overline X-\mu}{S/\sqrt{n}}\ \sim\ t_{n-1},\qquad
\frac{(n-1)S^2}{\sigma^2}\ \sim\ \chi^2_{\,n-1}.
\]
Definition
(Fisher Score & Information (sample))
Score: \(S_n(\theta)=\ell'_n(\theta)\). Fisher information: \(I_n(\theta)=-\mathbb E_\theta[\ell''_n(\theta)]\). Properties: \(\mathbb E_\theta[S_n(\theta)]=0\), \(\mathrm{Var}_\theta(S_n(\theta))=I_n(\theta)\).
Theorem
(Asymptotic Variance of the MLE)
For the MLE \(\hat\theta_n\), \(I_n(\theta)\,\mathrm{Var}_\theta(\hat\theta_n)\ \xrightarrow{P}\ 1\). Equivalently, \(\mathrm{se}(\hat\theta_n)\approx I_n(\theta)^{-1/2}\).
Theorem
(Delta Method)
If \(\sqrt{n}\,(\hat\theta-\theta)\Rightarrow N(0,\sigma^2)\) and \(g'(\theta)\neq0\), then
\[
\sqrt{n}\,\frac{g(\hat\theta)-g(\theta)}{\sigma\,g'(\theta)}\ \Rightarrow\ N(0,1).
\]
Theorem
(Extended Delta Method)
If \(g'(\theta)=0\) but \(g^{(k)}(\theta)\neq0\) for the smallest \(k\ge2\), then
\[
\frac{g(\hat\theta)-g(\theta)}{(\sigma/\sqrt{n})^{k}}\ \Rightarrow\ \frac{1}{k!}\,g^{(k)}(\theta)\,Z^{k},\quad Z\sim N(0,1).
\]
Theorem
(Multivariate Fisher Information & Delta)
For \(\boldsymbol\theta\in\mathbb R^{p}\), \(I_n(\boldsymbol\theta)=-\mathbb E[H_{\boldsymbol\theta}\ell_n]\) (Hessian). If \(\hat{\boldsymbol\theta}\) is the MLE and \(g:\mathbb R^{p}\to\mathbb R\) is differentiable, then
\[
\frac{g(\hat{\boldsymbol\theta})-g(\boldsymbol\theta)}{\ \sqrt{\nabla g(\hat{\boldsymbol\theta})^{\!\top} I_n(\hat{\boldsymbol\theta})^{-1}\nabla g(\hat{\boldsymbol\theta})}\ }\ \Rightarrow\ N(0,1).
\]
Confidence Intervals
Definition
(Confidence Interval (CI))
A \(100(1-\alpha)\%\) confidence interval for \(\theta\) is a random interval \(C_\alpha(X)\) such that
\(\mathbb{P}_\theta\!\big(\theta\in C_\alpha(X)\big)=1-\alpha\). The probability is over the sample \(X\); the parameter is fixed.
Definition
(CI via Pivot / Asymptotics)
If a statistic \(T_n=T_n(X,\theta)\) has known/null distribution independent of \(\theta\), use its quantiles.
More generally, if \(T_n\approx\mathcal{N}(0,1)\) for large \(n\), then a two-sided \(100(1-\alpha)\%\) CI is
\[
\widehat\theta\ \pm\ z_{\alpha/2}\,\mathrm{se}(\widehat\theta),
\]
with \(z_{\alpha/2}\) the standard normal quantile.
Remark
(Normal Mean (\(\sigma\) known))
If \(X_i\stackrel{\text{iid}}{\sim}\mathcal{N}(\mu,\sigma^2)\) with known \(\sigma\), then
\[
\frac{\overline X-\mu}{\sigma/\sqrt{n}}\sim\mathcal{N}(0,1)\quad\Rightarrow\quad
\mu\in\Big[\ \overline X\ \pm\ z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}}\ \Big].
\]
(Quantiles \(z_{0.95}=1.645\), \(z_{0.975}=1.96\), \(z_{0.995}=2.575\).)
Remark
(Wald CI from MLE)
From \(\widehat\theta\approx\mathcal{N}\!\big(\theta,[nI(\theta)]^{-1}\big)\),
\[
\theta\in\Big[\ \widehat\theta\ \pm\ z_{\alpha/2}\ \sqrt{\tfrac{1}{nI(\widehat\theta)}}\ \Big]
\]
is an approximate \(100(1-\alpha)\%\) CI.
Remark
(Visual: central \(1-\alpha\) mass under \(N(0,1)\))
Remark
(t-CI for a Normal Mean (\(\sigma\) unknown))
If \(X_i\sim N(\mu,\sigma^2)\) with unknown \(\sigma\), then a \(100(1-\alpha)\%\) CI for \(\mu\) is
\[
\overline X\ \pm\ t_{n-1,\,1-\alpha/2}\,\frac{S}{\sqrt{n}},
\]
where \(t_{n-1,\,1-\alpha/2}\) is the upper \((1-\alpha/2)\)-quantile of \(t_{n-1}\).
Hypothesis Testing
Definition
(Hypotheses & Test)
We formalise a claim about \(\theta\) as
\[
H_0:\theta\in\Theta_0 \quad\text{vs}\quad H_1:\theta\in\Theta_1,
\]
choose a statistic \(S(X)\) and reject \(H_0\) when \(S(X)\) falls in a critical region \(C\).
Definition
(Type I Error, Significance)
Type I error: rejecting \(H_0\) when \(H_0\) is true. Its probability \(\alpha=\mathbb{P}_\theta(S\in C)\) (for \(\theta\in\Theta_0\)) is the significance level.
Definition
(Type II Error & Power)
Type II error: failing to reject \(H_0\) when \(H_1\) is true; probability \(\beta(\theta)\) (for \(\theta\in\Theta_1\)).
The power function is \(\pi(\theta)=1-\beta(\theta)=\mathbb{P}_\theta(S\in C)\) for \(\theta\in\Theta_1\).
Definition
(p-value)
For observed \(x\), the p-value is the smallest \(\alpha\) for which \(x\) lies in a level-\(\alpha\) rejection region (equivalently, the tail probability under \(H_0\) of outcomes as or more extreme than \(x\)).
Remark
(Z-test for a Normal Mean (σ known))
With \(X_i\stackrel{\text{iid}}{\sim}\mathcal{N}(\mu,\sigma^2)\), test \(H_0:\mu=\mu_0\) using
\[
Z=\frac{\overline X-\mu_0}{\sigma/\sqrt{n}}\sim\mathcal{N}(0,1)\ \text{ under }H_0.
\]
Reject for \(|Z|>z_{\alpha/2}\) (two-sided); report \(p=2\{1-\Phi(|z_{\text{obs}}|)\}\).
Remark
(Visual: \(\alpha\) vs \(\beta\))
Remark
(t-tests for a Normal Mean (\(\sigma\) unknown))
Test \(H_0:\mu=\mu_0\) with
\[
T=\frac{\overline X-\mu_0}{S/\sqrt{n}}\ \sim\ t_{n-1}\ (H_0).
\]
Right-tail: reject if \(T>t_{1-\alpha,n-1}\); left-tail: \(T<t_{\alpha,n-1}\); two-sided: \(|T|>t_{1-\alpha/2,n-1}\).
p-values use the corresponding \(t_{n-1}\) tails; two-sided \(p=2\{1-F_{t_{n-1}}(|t_{\mathrm{obs}}|)\}\).
Definition
(Wald Test (general))
For parameter \(\theta\) with MLE \(\hat\theta\) and \(\widehat{\mathrm{se}}(\hat\theta)\),
\[
W=\frac{\hat\theta-\theta_0}{\widehat{\mathrm{se}}(\hat\theta)}\ \approx\ N(0,1).
\]
Right-tail: reject \(H_0:\theta\le\theta_0\) if \(W>z_{1-\alpha}\); left-tail: \(W<z_{\alpha}\); two-sided: \(|W|>z_{1-\alpha/2}\).
Definition
(Generalised Likelihood Ratio Test (GLRT))
For \(H_0:\theta\in\Theta_0\) vs \(H_1:\theta\in\Theta\setminus\Theta_0\),
\[
\Lambda(x)=\frac{\sup_{\theta\in\Theta_0}L(\theta)}{\sup_{\theta\in\Theta}L(\theta)},\qquad
\text{reject for small }\Lambda.
\]
Critical constant chosen to give size \(\alpha\).
Definition
(Power and Size)
Power \(\pi(\theta)=\mathbb P_\theta(\text{reject }H_0)\). Size (significance) \(\alpha\) satisfies \(\displaystyle \sup_{\theta\in\Theta_0}\pi(\theta)\le \alpha\).