This page pairs well with Statistics.

Elements of Probability Theory

Definition (Random Experiment, Sample Space, Events)
A random experiment has uncertain outcomes. The sample space $S$ is the set of all possible outcomes. An event $E$ is a subset of $S$. The certain event is $S$; the impossible event is $\varnothing$.

Definition (Probability Measure (Kolmogorov Axioms))

A probability space $(S,\mathcal{F},P)$ consists of a sample space $S$, a $\sigma$-algebra $\mathcal{F}\subseteq 2^S$, and a function $P:\mathcal{F}\to[0,1]$ such that:

  1. $P(\varnothing)=0$, $P(S)=1$;
  2. $P(E)\ge 0$ for all $E\in\mathcal{F}$;
  3. ($\sigma$-additivity) For any countable family of pairwise disjoint events $\{E_i\}$, $P\!\left(\bigcup_i E_i\right)=\sum_i P(E_i)$.

Theorem (Additive Law of Probability)
For any events $E_1,E_2$, \[ P(E_1\cup E_2)=P(E_1)+P(E_2)-P(E_1\cap E_2). \] More generally, \[ P\!\left(\bigcup_{i=1}^n E_i\right)=\sum_i P(E_i)-\sum_{i<j}P(E_i\cap E_j)+\cdots+(-1)^{n-1}P\!\left(\bigcap_{i=1}^n E_i\right). \]

Definition (Conditional Probability)
If $P(F)>0$, define \[ P(E\mid F)=\frac{P(E\cap F)}{P(F)}. \] Equivalently, $P(E\cap F)=P(E\mid F)P(F)$.

Theorem (Multiplicative Law)
For any events $E_1,\dots,E_n$ with $P(E_1\cap\cdots\cap E_{k-1})>0$ for each $k$, \[ P\!\left(\bigcap_{k=1}^n E_k\right)=\prod_{k=1}^n P\!\left(E_k\mid \bigcap_{j<k}E_j\right). \]

Theorem (Law of Total Probability)
If $\{B_i\}_{i\in I}$ is a (finite or countable) partition of $S$ with $P(B_i)>0$, then for any event $A$, \[ P(A)=\sum_{i\in I} P(A\mid B_i)P(B_i). \]

Theorem (Bayes' Rule)
Given a partition $\{B_i\}$ with $P(B_i)>0$ and any event $A$ with $P(A)>0$, \[ P(B_i\mid A)=\frac{P(A\mid B_i)P(B_i)}{\sum_j P(A\mid B_j)P(B_j)}. \]

Definition (Independence)
Events $E$ and $F$ are independent if $P(E\cap F)=P(E)P(F)$ (equivalently, if $P(F)>0$, then $P(E\mid F)=P(E)$). A family $\{E_i\}$ is pairwise independent if each pair is independent; it is mutually independent if \[ P\!\left(\bigcap_{i\in J} E_i\right)=\prod_{i\in J}P(E_i)\quad\text{for every finite }J. \]

Theorem (Markov's Inequality)
If $X\ge 0$ and $a>0$, then \[ \mathbb P(X\ge a)\le \frac{\mathbb E[X]}{a}. \]

Theorem (Chebyshev's Inequality)
If $\mathbb E[X]=\mu$ and $\operatorname{Var}(X)=\sigma^2<\infty$, then for any $k>0$, \[ \mathbb P\!\big(|X-\mu|>k\sigma\big)\le \frac{1}{k^2}. \]

Theorem (Jensen's Inequality)
If $h$ is convex and $X$ is integrable, then \[ h\!\big(\mathbb E[X]\big)\le \mathbb E\!\big[h(X)\big]. \] (Equality iff $h$ is affine on the support of $X$ or $X$ is degenerate.)

Random Variables

Definition (Random Variable)
A (real-valued) random variable is a measurable function $X:S\to\mathbb{R}$. We distinguish $X$ (the variable), a realised value $x=\text{generic number}$, and the observed $\,\hat{x}=X(\hat{\omega})$.

Definition (Cumulative Distribution Function (CDF))

The CDF of $X$ is \[ F_X(x)=P(X\le x),\quad x\in\mathbb{R}. \] Key properties:

  1. non-decreasing;
  2. right-continuous;
  3. $\lim_{x\to-\infty}F_X(x)=0$;
  4. $\lim_{x\to+\infty}F_X(x)=1$.

$F_X$ completely characterises the distribution.

Definition (Discrete Random Variable and PMF)
$X$ is discrete if $F_X$ is a step function with (at most) countably many jumps at $\{x_k\}$. The probability mass function (pmf) is $p_X(x)=P(X=x)$, with $\sum_{k}p_X(x_k)=1$ and \[ F_X(x)=\sum_{x_k\le x} p_X(x_k),\qquad p_X(x_k)=F_X(x_k)-F_X(x_k^-). \]

Definition (Continuous Random Variable and PDF)
$X$ is continuous if $F_X$ is continuous (hence $P(X=x)=0$ for all $x$). A probability density function (pdf) $f_X$ satisfies \[ P(X\in A)=\int_A f_X(x)\,dx,\qquad F_X(x)=\int_{-\infty}^x f_X(y)\,dy,\quad\text{and}\quad f_X(x)=\frac{d}{dx}F_X(x) \] (where differentiable), with $f_X\ge 0$ and $\int_\mathbb{R} f_X=1$.

Definition (Expectation)

For a r.v. $X$ with CDF $F_X$, \[\mathbb{E}[X]=\int_{-\infty}^{\infty} x\,dF_X(x) =\begin{cases}\displaystyle \sum_{x\in S_X} x\,p_X(x), & \text{(discrete)}\\[0.5em] \displaystyle \int_{S_X} x\,f_X(x)\,dx, & \text{(continuous).}\end{cases}\]

Linearity: $\mathbb{E}[aX+b]=a\,\mathbb{E}[X]+b$.

Existence: $\mathbb{E}[X]$ is finite iff $\mathbb{E}[|X|]<\infty$.

Definition (Moments and Variance)
The $k$-th (raw) moment: $\mathbb{E}[X^k]$. The $k$-th central moment: $\mathbb{E}[(X-\mu)^k]$ with $\mu=\mathbb{E}[X]$. The variance is the 2nd central moment: \[ \operatorname{Var}(X)=\mathbb{E}\!\left[(X-\mu)^2\right]\ge 0,\quad \sigma=\sqrt{\operatorname{Var}(X)}. \] Useful identities: \[ \operatorname{Var}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2,\qquad \operatorname{Var}(aX+b)=a^2\,\operatorname{Var}(X). \]

Definition (Moment Generating Function (MGF))
For a r.v.\ $X$, the moment generating function is $M_X(u)=\mathbb E[e^{uX}]$ (where finite near $u=0$). If it exists, then $M_X^{(n)}(0)=\mathbb E[X^n]$. Key facts: (i) Uniqueness — $M_X=M_Y$ implies $X\overset{d}=Y$; (ii) Independence — for independent $X,Y$, \[ M_{X+Y}(u)=M_X(u)\,M_Y(u). \]

Definition (Quantiles and QQ-Plots)

For continuous $X$, the $100k\%$ quantile is $Q_X(k)=F_X^{-1}(k)$, $0<k<1$. A QQ-plot compares sample order statistics $\{x_{(k)}\}$ to theoretical quantiles $\{F^{-1}(p_k)\}$ (e.g.\ $p_k=(k-0.5)/n$); points should lie roughly on a straight line if the model fits.

Theorem (Weak Law of Large Numbers (WLLN))
If $X_1,X_2,\dots$ are i.i.d.\ with $\mathbb E[X_i]=\mu$ and $\operatorname{Var}(X_i)=\sigma^2<\infty$, then for every $\varepsilon>0$, \[ \lim_{n\to\infty}\mathbb P\!\left(\big|\,\overline X_n-\mu\,\big|\ge \varepsilon\right)=0, \] i.e.\ $\overline X_n \xrightarrow{\ \mathbb P\ } \mu$.

Theorem (Strong Law of Large Numbers (SLLN))
Under the same conditions, \[ \mathbb P\!\left(\lim_{n\to\infty}\overline X_n=\mu\right)=1, \] i.e.\ $\overline X_n \xrightarrow{\ \mathrm{a.s.}\ } \mu$.

Theorem (Central Limit Theorem (CLT))
If $X_1,\dots,X_n$ are i.i.d.\ with mean $\mu$ and variance $0<\sigma^2<\infty$, then \[ Z_n=\sqrt{n}\,\frac{\overline X_n-\mu}{\sigma}\ \xRightarrow[n\to\infty]{d}\ N(0,1), \] equivalently $\ \mathbb P(Z_n\le z)\to \Phi(z)\ $ for all $z\in\mathbb R$.

Definition (Modes of Convergence)

\(X_n \xrightarrow{\text{a.s.}} X \;\Rightarrow\; X_n \xrightarrow{P} X \;\Rightarrow\; X_n \xrightarrow{d} X.\) Definitions:

  • In distribution: \(F_{X_n}(x)\to F_X(x)\) at continuity points of \(F_X\).
  • In probability: \(\forall\varepsilon>0,\ \mathbb P(|X_n-X|>\varepsilon)\to0.\)
  • Almost surely: \(\mathbb P(\lim_{n\to\infty}X_n=X)=1.\)

Remark (Normal Approximation to Binomial)
If \(X\sim\mathrm{Bin}(n,p)\), then \(\dfrac{X-np}{\sqrt{np(1-p)}}\Rightarrow N(0,1)\) as \(n\to\infty\).

Random Vectors

Definition (Random Vector and Joint Distribution)
A random vector $(X_1,\dots,X_n):S\to\mathbb{R}^n$ has joint distribution. For $(X_1,X_2)$, the joint CDF is \[ F_{X_1X_2}(x_1,x_2)=P(X_1\le x_1,\,X_2\le x_2), \] with: non-decreasing in each coordinate, right-continuous, and limits $F_{X_1X_2}(+\infty,+\infty)=1$, $F_{X_1X_2}(x_1,-\infty)=F_{X_1X_2}(-\infty,x_2)=0$.

Definition (Marginals and Independence)

Marginals: $F_{X_1}(x_1)=\lim_{x_2\to +\infty}F_{X_1X_2}(x_1,x_2)$ (and symmetrically for $X_2$).

Independence: $X_1$ and $X_2$ are independent iff $F_{X_1X_2}(x_1,x_2)=F_{X_1}(x_1)F_{X_2}(x_2)$ for all $(x_1,x_2)$ (equivalently, $p_{X_1X_2}=p_{X_1}p_{X_2}$ in discrete case; $f_{X_1X_2}=f_{X_1}f_{X_2}$ in continuous case).

Definition (Expectation of Functions (LOTUS for vectors))

For $g:\mathbb{R}^2\to\mathbb{R}$, \[\mathbb{E}[g(X_1,X_2)]= \begin{cases}\displaystyle \sum_{x_1,x_2} g(x_1,x_2)\,p_{X_1X_2}(x_1,x_2), & \text{(discrete)}\\[0.5em] \displaystyle \iint g(x_1,x_2)\,f_{X_1X_2}(x_1,x_2)\,dx_2\,dx_1, & \text{(continuous).}\end{cases}\]

In particular, $\mathbb{E}[aX_1+bX_2]=a\,\mathbb{E}[X_1]+b\,\mathbb{E}[X_2]$.

Definition (Covariance and Correlation)
Let $\mu_k=\mathbb{E}[X_k]$. The covariance is \[ \operatorname{Cov}(X_1,X_2)=\mathbb{E}[(X_1-\mu_1)(X_2-\mu_2)], \] and the correlation coefficient is \[ \rho(X_1,X_2)=\frac{\operatorname{Cov}(X_1,X_2)}{\sqrt{\operatorname{Var}(X_1)\operatorname{Var}(X_2)}}\in[-1,1]. \] If $X_1\perp X_2$, then $\operatorname{Cov}(X_1,X_2)=0$, and \[ \operatorname{Var}(X_1+X_2)=\operatorname{Var}(X_1)+\operatorname{Var}(X_2). \]

Definition (Mean Vector, Covariance Matrix, Linear Transforms)
For $X=(X_1,\dots,X_n)^\top$, define the mean vector \[ \mu_X=\mathbb{E}[X]=\big(\mathbb{E}[X_1],\dots,\mathbb{E}[X_n]\big)^\top, \] and the covariance matrix $\Sigma_X=\operatorname{Cov}(X)$ with $(i,j)$ entry $\operatorname{Cov}(X_i,X_j)$. If $Y=AX+b$ with constant matrix $A$ and vector $b$, then \[ \mathbb{E}[Y]=A\,\mu_X+b,\qquad \operatorname{Cov}(Y)=A\,\Sigma_X\,A^\top. \]

Definition (Conditional Density and Conditional Expectation)
For continuous $(X,Y)$ with joint pdf $f_{X,Y}$ and marginal $f_Y(y)>0$, \[ f_{X\mid Y}(x\mid y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}. \] Then \[ \mathbb E[g(X)\mid Y=y]=\int g(x)\,f_{X\mid Y}(x\mid y)\,dx. \] (Discrete analogues use pmfs and sums.)

Definition (Bivariate Normal)
$(X_1,X_2)$ is Gaussian with mean $\mu=(\mu_1,\mu_2)$ and covariance $V$ if \[ f(x)=\frac{1}{2\pi \sqrt{\det V}} \exp\!\Big(-\tfrac12\,(x-\mu)^\top V^{-1}(x-\mu)\Big). \]

Common Distributions

Definition (Continuous Uniform)
If \(X\sim U[\alpha,\beta]\) with \(\alpha<\beta\), then \[ f_X(x)=\frac{1}{\beta-\alpha}\;1_{\{\alpha\le x\le\beta\}},\qquad F_X(x)=\frac{x-\alpha}{\beta-\alpha}\ (x\in[\alpha,\beta]), \] \[ \mathbb E[X]=\frac{\alpha+\beta}{2},\quad \mathrm{Var}(X)=\frac{(\beta-\alpha)^2}{12}. \] For any \(a<b\subseteq[\alpha,\beta]\): \(\mathbb P(a<X<b)=\frac{b-a}{\beta-\alpha}\).

Definition (Standard Uniform \(U[0,1]\))
\[f_U(u)=1_{[0,1]}(u),\quad F_U(u)=\begin{cases}0,&u<0\\u,&0\le u\le 1\\1,&u>1\end{cases}\] \(\mathbb E[U]=\tfrac12,\ \mathrm{Var}(U)=\tfrac1{12}\).

Definition (Bernoulli)

A Bernoulli r.v. \(X\sim \mathrm{Bern}(\pi)\) takes values in \(\{0,1\}\) with \(\mathbb P(X=1)=\pi,\ \mathbb P(X=0)=1-\pi\).

Moments: \(\mathbb E[X]=\pi,\ \mathrm{Var}(X)=\pi(1-\pi)\).

mgf: \(\varphi_X(s)=(1-\pi)+\pi e^{s}\).

Definition (Binomial)
If \(X=\sum_{i=1}^n X_i\) with \(X_i\overset{\text{i.i.d.}}{\sim}\mathrm{Bern}(\pi)\), then \[ X\sim \mathrm{Bin}(n,\pi),\quad \mathbb P(X=x)=\binom{n}{x}\pi^x(1-\pi)^{n-x},\ x=0,\dots,n. \] \(\mathbb E[X]=n\pi,\quad \mathrm{Var}(X)=n\pi(1-\pi),\quad \varphi_X(s)=[(1-\pi)+\pi e^s]^{n}\). (Reproductive property: \(X_1\sim\mathrm{Bin}(n_1,\pi)\), \(X_2\sim\mathrm{Bin}(n_2,\pi)\) independent \(\Rightarrow\ X_1+X_2\sim\mathrm{Bin}(n_1+n_2,\pi)\).)

Definition (Geometric)
Given i.i.d. Bernoulli trials with success prob. \(\pi\), let \(X=\min\{i\in\mathbb N: X_i=1\}\) (trials until first success). Then \(X\sim\mathrm{Geo}(\pi)\) on \(\{1,2,\ldots\}\) with \[ \mathbb P(X=x)=(1-\pi)^{x-1}\pi,\qquad F_X(x)=1-(1-\pi)^{\lfloor x\rfloor}. \] (Memoryless: \(\mathbb P(X=m+n\mid X>m)=(1-\pi)^{n-1}\pi=\mathbb P(X=n)\).)

Definition (Poisson)
\(X\sim \mathrm{Pois}(\lambda)\) on \(\{0,1,\dots\}\) has pmf \[ \mathbb P(X=x)=e^{-\lambda}\frac{\lambda^{x}}{x!}. \] \(\mathbb E[X]=\lambda,\ \mathrm{Var}(X)=\lambda,\ \varphi_X(s)=\exp\{\lambda(e^{s}-1)\}\). Reproductive: if \(X_i\sim \mathrm{Pois}(\lambda_i)\) independent then \(\sum_i X_i\sim \mathrm{Pois}(\sum_i\lambda_i)\). Extension (Poisson process): number of events in \([0,t]\) is \(N_t\sim \mathrm{Pois}(\lambda t)\).

Definition (Exponential)

If interarrival times of a Poisson process have rate \(\lambda>0\), then \(T\sim \mathrm{Exp}(\lambda)\) with cdf/pedf \[ F_T(t)=1-e^{-\lambda t}\ (t\ge0),\qquad f_T(t)=\lambda e^{-\lambda t}\ 1_{\{t\ge 0\}}. \]

Definition (Gamma)
For shape \(k>0\) and rate \(\lambda>0\), \[ X\sim \mathrm{Gamma}(k,\lambda)\quad\Longleftrightarrow\quad f_X(x)=\frac{\lambda e^{-\lambda x}(\lambda x)^{k-1}}{\Gamma(k)}\,1_{\{x>0\}}. \] If \(T_i\overset{\text{i.i.d.}}{\sim}\mathrm{Exp}(\lambda)\) then \(\sum_{i=1}^{n}T_i\sim \mathrm{Gamma}(n,\lambda)\).

Definition (Normal / Gaussian)

Standard normal: \(Z\sim N(0,1)\) with \[ \varphi(z)=\frac{1}{\sqrt{2\pi}}e^{-z^2/2},\quad \Phi(z)=\int_{-\infty}^{z}\varphi(u)\,du. \] General normal: \(X\sim N(\mu,\sigma^2)\) has \[ f(x)=\frac{1}{\sqrt{2\pi}\,\sigma}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). \] Standardisation: \(Z=(X-\mu)/\sigma\sim N(0,1)\).

Definition (Hypergeometric)
If a population has $N$ items with $m$ successes and $N-m$ failures, and we draw $n$ without replacement, then $X\sim\mathrm{Hyp}(n,m,N)$ with \[ \mathbb P(X=x)=\frac{\binom{m}{x}\binom{N-m}{n-x}}{\binom{N}{n}},\qquad x=0,1,\dots,n, \] and $\ \mathbb E[X]=\frac{mn}{N}$.

Definition (Beta)
For $\alpha,\beta>0$, $X\sim\mathrm{Beta}(\alpha,\beta)$ on $(0,1)$ with \[ f(x)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)},\qquad B(\alpha,\beta)=\int_0^1 t^{\alpha-1}(1-t)^{\beta-1}\,dt. \]

Transformations of Random Variables

Definition (Monotone Transformation (Continuous))
If \(Y=\phi(X)\) with \(\phi\) strictly monotone and differentiable, \[ f_Y(y)=f_X\!\big(\phi^{-1}(y)\big)\,\Big|\big(\phi^{-1}\big)'(y)\Big| = f_X\!\big(\phi^{-1}(y)\big)\,\left|\frac{1}{\phi'(\phi^{-1}(y))}\right|. \] (From \(F_Y\) by cases ↑/↓ and the inverse-function theorem.)

Definition (Linear Transformation (Univariate))
If \(Y=aX+b\) with \(a\neq0\) and \(X\) continuous, \[ f_Y(y)=\frac{1}{|a|}\,f_X\!\left(\frac{y-b}{a}\right). \]

Definition (General (Piecewise-Monotone) Transformation)
If \(\phi\) is piecewise monotone with inverse branches \(\{\phi_k^{-1}:J_k\to I_k\}\), \[ f_Y(y)=\sum_k f_X\!\big(\phi_k^{-1}(y)\big)\, \Big|\big(\phi_k^{-1}\big)'(y)\Big|\;1_{\{y\in J_k\}}. \] (Useful e.g. \(Y=Z^2\) with \(Z\sim N(0,1)\Rightarrow Y\sim\chi^2_1\).)

Definition (Change of Variables (Bivariate))

If \(Y=\phi(X)\) where \(X=(X_1,X_2)\), \(Y=(Y_1,Y_2)\), \(\phi\) is one-to-one and differentiable with inverse \(\psi=\phi^{-1}\), then with Jacobian \(J(y)=\det\big[\partial \psi_i/\partial y_j\big]\), \[ f_Y(y)=f_X\!\big(\psi(y)\big)\,|J(y)|. \]

Theorem (Convolution: Sum of Independent Variables)

If \(X_1\perp\!\!\!\perp X_2\):

  • Discrete: \(p_{X_1+X_2}(y)=\displaystyle\sum_{x} p_{X_1}(y-x)\,p_{X_2}(x)\).
  • Continuous: \(f_{X_1+X_2}(y)=\displaystyle\int f_{X_1}(y-x)\,f_{X_2}(x)\,dx\).

Theorem (Min of Independent Exponentials)
If \(T_i\sim\mathrm{Exp}(\lambda_i)\) independent, \[ \min_i T_i\ \sim\ \mathrm{Exp}\!\Big(\sum_i\lambda_i\Big),\qquad \mathbb P\{\operatorname*{argmin}=j\}=\frac{\lambda_j}{\sum_i\lambda_i}. \]

Theorem (Sum of i.i.d. Exponentials \(\Rightarrow\) Gamma)
If \(T_1,\dots,T_n\overset{\text{i.i.d.}}{\sim}\mathrm{Exp}(\lambda)\), then \(\sum_{i=1}^{n}T_i\sim\mathrm{Gamma}(n,\lambda)\).

Definition (Log-Normal via Monotone Transform)
If \(Z\sim N(0,1)\) and \(Y=e^{Z}\), then \[ f_Y(y)=\frac{1}{y\sqrt{2\pi}}\exp\!\left(-\frac{(\log y)^2}{2}\right),\quad y>0. \]

Definition (Chi-Square from Squaring a Normal)
If \(Z\sim N(0,1)\) and \(Y=Z^2\), then \(Y\sim\chi^2_1\) with \[ f_Y(y)=\frac{1}{\sqrt{2\pi\,y}}\,e^{-y/2},\quad y>0. \] (Non-monotone transform handled by two branches \(Z=\pm\sqrt{y}\).)

Theorem (Probability Integral Transform)
If \(F_X\) is strictly increasing and \(U=F_X(X)\), then \(U\sim U(0,1)\).

todo

  • Borel-Cantelli Lemma
  • Martingale Convergence Theorem