Probability
This page pairs well with Statistics.
Elements of Probability Theory
A random experiment has uncertain outcomes. The sample space \(S\) is the set of all possible outcomes. An event \(E\) is a subset of \(S\). The certain event is \(S\); the impossible event is \(\varnothing\).
A probability space \((S,\mathcal{F},P)\) consists of a sample space \(S\), a \(\sigma\)-algebra \(\mathcal{F}\subseteq 2^S\), and a function \(P:\mathcal{F}\to[0,1]\) such that:
- \(P(\varnothing)=0\), \(P(S)=1\);
- \(P(E)\ge 0\) for all \(E\in\mathcal{F}\);
- (\(\sigma\)-additivity) For any countable family of pairwise disjoint events \(\{E_i\}\), \(P\!\left(\bigcup_i E_i\right)=\sum_i P(E_i)\).
For any events \(E_1,E_2\), \[ P(E_1\cup E_2)=P(E_1)+P(E_2)-P(E_1\cap E_2). \] More generally, \[ P\!\left(\bigcup_{i=1}^n E_i\right)=\sum_i P(E_i)-\sum_{i<j}P(E_i\cap E_j)+\cdots+(-1)^{n-1}P\!\left(\bigcap_{i=1}^n E_i\right). \]
If \(P(F)>0\), define \[ P(E\mid F)=\frac{P(E\cap F)}{P(F)}. \] Equivalently, \(P(E\cap F)=P(E\mid F)P(F)\).
For any events \(E_1,\dots,E_n\) with \(P(E_1\cap\cdots\cap E_{k-1})>0\) for each \(k\), \[ P\!\left(\bigcap_{k=1}^n E_k\right)=\prod_{k=1}^n P\!\left(E_k\mid \bigcap_{j<k}E_j\right). \]
If \(\{B_i\}_{i\in I}\) is a (finite or countable) partition of \(S\) with \(P(B_i)>0\), then for any event \(A\), \[ P(A)=\sum_{i\in I} P(A\mid B_i)P(B_i). \]
Given a partition \(\{B_i\}\) with \(P(B_i)>0\) and any event \(A\) with \(P(A)>0\), \[ P(B_i\mid A)=\frac{P(A\mid B_i)P(B_i)}{\sum_j P(A\mid B_j)P(B_j)}. \]
Events \(E\) and \(F\) are independent if \(P(E\cap F)=P(E)P(F)\) (equivalently, if \(P(F)>0\), then \(P(E\mid F)=P(E)\)). A family \(\{E_i\}\) is pairwise independent if each pair is independent; it is mutually independent if \[ P\!\left(\bigcap_{i\in J} E_i\right)=\prod_{i\in J}P(E_i)\quad\text{for every finite }J. \]
If \(X\ge 0\) and \(a>0\), then \[ \mathbb P(X\ge a)\le \frac{\mathbb E[X]}{a}. \]
If \(\mathbb E[X]=\mu\) and \(\operatorname{Var}(X)=\sigma^2<\infty\), then for any \(k>0\), \[ \mathbb P\!\big(|X-\mu|>k\sigma\big)\le \frac{1}{k^2}. \]
If \(h\) is convex and \(X\) is integrable, then \[ h\!\big(\mathbb E[X]\big)\le \mathbb E\!\big[h(X)\big]. \] (Equality iff \(h\) is affine on the support of \(X\) or \(X\) is degenerate.)
Random Variables
A (real-valued) random variable is a measurable function \(X:S\to\mathbb{R}\). We distinguish \(X\) (the variable), a realised value \(x=\text{generic number}\), and the observed \(\,\hat{x}=X(\hat{\omega})\).
The CDF of \(X\) is \[ F_X(x)=P(X\le x),\quad x\in\mathbb{R}. \] Key properties:
- non-decreasing;
- right-continuous;
- \(\lim_{x\to-\infty}F_X(x)=0\);
- \(\lim_{x\to+\infty}F_X(x)=1\).
\(F_X\) completely characterises the distribution.
\(X\) is discrete if \(F_X\) is a step function with (at most) countably many jumps at \(\{x_k\}\). The probability mass function (pmf) is \(p_X(x)=P(X=x)\), with \(\sum_{k}p_X(x_k)=1\) and \[ F_X(x)=\sum_{x_k\le x} p_X(x_k),\qquad p_X(x_k)=F_X(x_k)-F_X(x_k^-). \]
\(X\) is continuous if \(F_X\) is continuous (hence \(P(X=x)=0\) for all \(x\)). A probability density function (pdf) \(f_X\) satisfies \[ P(X\in A)=\int_A f_X(x)\,dx,\qquad F_X(x)=\int_{-\infty}^x f_X(y)\,dy,\quad\text{and}\quad f_X(x)=\frac{d}{dx}F_X(x) \] (where differentiable), with \(f_X\ge 0\) and \(\int_\mathbb{R} f_X=1\).
For a r.v. \(X\) with CDF \(F_X\), \[\mathbb{E}[X]=\int_{-\infty}^{\infty} x\,dF_X(x) =\begin{cases}\displaystyle \sum_{x\in S_X} x\,p_X(x), & \text{(discrete)}\\ \displaystyle \int_{S_X} x\,f_X(x)\,dx, & \text{(continuous).}\end{cases}\]
Linearity: \(\mathbb{E}[aX+b]=a\,\mathbb{E}[X]+b\).
Existence: \(\mathbb{E}[X]\) is finite iff \(\mathbb{E}[|X|]<\infty\).
The $k$-th (raw) moment: \(\mathbb{E}[X^k]\). The $k$-th central moment: \(\mathbb{E}[(X-\mu)^k]\) with \(\mu=\mathbb{E}[X]\). The variance is the 2nd central moment: \[ \operatorname{Var}(X)=\mathbb{E}\!\left[(X-\mu)^2\right]\ge 0,\quad \sigma=\sqrt{\operatorname{Var}(X)}. \] Useful identities: \[ \operatorname{Var}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2,\qquad \operatorname{Var}(aX+b)=a^2\,\operatorname{Var}(X). \]
For a r.v.\ \(X\), the moment generating function is \(M_X(u)=\mathbb E[e^{uX}]\) (where finite near \(u=0\)). If it exists, then \(M_X^{(n)}(0)=\mathbb E[X^n]\). Key facts: (i) Uniqueness — \(M_X=M_Y\) implies \(X\overset{d}=Y\); (ii) Independence — for independent \(X,Y\), \[ M_{X+Y}(u)=M_X(u)\,M_Y(u). \]
For continuous \(X\), the \(100k\%\) quantile is \(Q_X(k)=F_X^{-1}(k)\), \(0<k<1\).
A QQ-plot compares sample order statistics \(\{x_{(k)}\}\) to theoretical quantiles \(\{F^{-1}(p_k)\}\) (e.g.\ \(p_k=(k-0.5)/n\)); points should lie roughly on a straight line if the model fits.
If \(X_1,X_2,\dots\) are i.i.d.\ with \(\mathbb E[X_i]=\mu\) and \(\operatorname{Var}(X_i)=\sigma^2<\infty\), then for every \(\varepsilon>0\), \[ \lim_{n\to\infty}\mathbb P\!\left(\big|\,\overline X_n-\mu\,\big|\ge \varepsilon\right)=0, \] i.e.\ \(\overline X_n \xrightarrow{\ \mathbb P\ } \mu\).
Under the same conditions, \[ \mathbb P\!\left(\lim_{n\to\infty}\overline X_n=\mu\right)=1, \] i.e.\ \(\overline X_n \xrightarrow{\ \mathrm{a.s.}\ } \mu\).
If \(X_1,\dots,X_n\) are i.i.d.\ with mean \(\mu\) and variance \(0<\sigma^2<\infty\), then \[Z_n=\sqrt{n}\,\frac{\overline X_n-\mu}{\sigma}\ \xRightarrow[n\to\infty]{d}\ N(0,1),\] equivalently \(\ \mathbb P(Z_n\le z)\to \Phi(z)\ \) for all \(z\in\mathbb R\).
\(X_n \xrightarrow{\text{a.s.}} X \;\Rightarrow\; X_n \xrightarrow{P} X \;\Rightarrow\; X_n \xrightarrow{d} X.\) Definitions:
- In distribution: \(F_{X_n}(x)\to F_X(x)\) at continuity points of \(F_X\).
- In probability: \(\forall\varepsilon>0,\ \mathbb P(|X_n-X|>\varepsilon)\to0.\)
- Almost surely: \(\mathbb P(\lim_{n\to\infty}X_n=X)=1.\)
If \(X\sim\mathrm{Bin}(n,p)\), then \(\dfrac{X-np}{\sqrt{np(1-p)}}\Rightarrow N(0,1)\) as \(n\to\infty\).
Random Vectors
A random vector \((X_1,\dots,X_n):S\to\mathbb{R}^n\) has joint distribution. For \((X_1,X_2)\), the joint CDF is \[ F_{X_1X_2}(x_1,x_2)=P(X_1\le x_1,\,X_2\le x_2), \] with: non-decreasing in each coordinate, right-continuous, and limits \(F_{X_1X_2}(+\infty,+\infty)=1\), \(F_{X_1X_2}(x_1,-\infty)=F_{X_1X_2}(-\infty,x_2)=0\).
Marginals: \(F_{X_1}(x_1)=\lim_{x_2\to +\infty}F_{X_1X_2}(x_1,x_2)\) (and symmetrically for \(X_2\)).
Independence: \(X_1\) and \(X_2\) are independent iff \(F_{X_1X_2}(x_1,x_2)=F_{X_1}(x_1)F_{X_2}(x_2)\) for all \((x_1,x_2)\) (equivalently, \(p_{X_1X_2}=p_{X_1}p_{X_2}\) in discrete case; \(f_{X_1X_2}=f_{X_1}f_{X_2}\) in continuous case).
For \(g:\mathbb{R}^2\to\mathbb{R}\), \[\mathbb{E}[g(X_1,X_2)]= \begin{cases}\displaystyle \sum_{x_1,x_2} g(x_1,x_2)\,p_{X_1X_2}(x_1,x_2), & \text{(discrete)}\\ \displaystyle \iint g(x_1,x_2)\,f_{X_1X_2}(x_1,x_2)\,dx_2\,dx_1, & \text{(continuous).}\end{cases}\]
In particular, \(\mathbb{E}[aX_1+bX_2]=a\,\mathbb{E}[X_1]+b\,\mathbb{E}[X_2]\).
Let \(\mu_k=\mathbb{E}[X_k]\). The covariance is \[ \operatorname{Cov}(X_1,X_2)=\mathbb{E}[(X_1-\mu_1)(X_2-\mu_2)], \] and the correlation coefficient is \[ \rho(X_1,X_2)=\frac{\operatorname{Cov}(X_1,X_2)}{\sqrt{\operatorname{Var}(X_1)\operatorname{Var}(X_2)}}\in[-1,1]. \] If \(X_1\perp X_2\), then \(\operatorname{Cov}(X_1,X_2)=0\), and \[ \operatorname{Var}(X_1+X_2)=\operatorname{Var}(X_1)+\operatorname{Var}(X_2). \]
For \(X=(X_1,\dots,X_n)^\top\), define the mean vector \[ \mu_X=\mathbb{E}[X]=\big(\mathbb{E}[X_1],\dots,\mathbb{E}[X_n]\big)^\top, \] and the covariance matrix \(\Sigma_X=\operatorname{Cov}(X)\) with \((i,j)\) entry \(\operatorname{Cov}(X_i,X_j)\). If \(Y=AX+b\) with constant matrix \(A\) and vector \(b\), then \[ \mathbb{E}[Y]=A\,\mu_X+b,\qquad \operatorname{Cov}(Y)=A\,\Sigma_X\,A^\top. \]
For continuous \((X,Y)\) with joint pdf \(f_{X,Y}\) and marginal \(f_Y(y)>0\), \[ f_{X\mid Y}(x\mid y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}. \] Then \[ \mathbb E[g(X)\mid Y=y]=\int g(x)\,f_{X\mid Y}(x\mid y)\,dx. \] (Discrete analogues use pmfs and sums.)
\((X_1,X_2)\) is Gaussian with mean \(\mu=(\mu_1,\mu_2)\) and covariance \(V\) if \[ f(x)=\frac{1}{2\pi \sqrt{\det V}} \exp\!\Big(-\tfrac12\,(x-\mu)^\top V^{-1}(x-\mu)\Big). \]
Common Distributions
If \(X\sim U[\alpha,\beta]\) with \(\alpha<\beta\), then \[ f_X(x)=\frac{1}{\beta-\alpha}\;1_{\{\alpha\le x\le\beta\}},\qquad F_X(x)=\frac{x-\alpha}{\beta-\alpha}\ (x\in[\alpha,\beta]), \] \[ \mathbb E[X]=\frac{\alpha+\beta}{2},\quad \mathrm{Var}(X)=\frac{(\beta-\alpha)^2}{12}. \] For any \(a<b\subseteq[\alpha,\beta]\): \(\mathbb P(a<X<b)=\frac{b-a}{\beta-\alpha}\).
\[f_U(u)=1_{[0,1]}(u),\quad F_U(u)=\begin{cases}0,&u<0\\u,&0\le u\le 1\\1,&u>1\end{cases}\] \(\mathbb E[U]=\tfrac12,\ \mathrm{Var}(U)=\tfrac1{12}\).
A Bernoulli r.v. \(X\sim \mathrm{Bern}(\pi)\) takes values in \(\{0,1\}\) with \(\mathbb P(X=1)=\pi,\ \mathbb P(X=0)=1-\pi\).
Moments: \(\mathbb E[X]=\pi,\ \mathrm{Var}(X)=\pi(1-\pi)\).
mgf: \(\varphi_X(s)=(1-\pi)+\pi e^{s}\).
If \(X=\sum_{i=1}^n X_i\) with \(X_i\overset{\text{i.i.d.}}{\sim}\mathrm{Bern}(\pi)\), then \[ X\sim \mathrm{Bin}(n,\pi),\quad \mathbb P(X=x)=\binom{n}{x}\pi^x(1-\pi)^{n-x},\ x=0,\dots,n. \] \(\mathbb E[X]=n\pi,\quad \mathrm{Var}(X)=n\pi(1-\pi),\quad \varphi_X(s)=[(1-\pi)+\pi e^s]^{n}\). (Reproductive property: \(X_1\sim\mathrm{Bin}(n_1,\pi)\), \(X_2\sim\mathrm{Bin}(n_2,\pi)\) independent \(\Rightarrow\ X_1+X_2\sim\mathrm{Bin}(n_1+n_2,\pi)\).)
Given i.i.d. Bernoulli trials with success prob. \(\pi\), let \(X=\min\{i\in\mathbb N: X_i=1\}\) (trials until first success). Then \(X\sim\mathrm{Geo}(\pi)\) on \(\{1,2,\ldots\}\) with \[ \mathbb P(X=x)=(1-\pi)^{x-1}\pi,\qquad F_X(x)=1-(1-\pi)^{\lfloor x\rfloor}. \] (Memoryless: \(\mathbb P(X=m+n\mid X>m)=(1-\pi)^{n-1}\pi=\mathbb P(X=n)\).)
\(X\sim \mathrm{Pois}(\lambda)\) on \(\{0,1,\dots\}\) has pmf \[ \mathbb P(X=x)=e^{-\lambda}\frac{\lambda^{x}}{x!}. \] \(\mathbb E[X]=\lambda,\ \mathrm{Var}(X)=\lambda,\ \varphi_X(s)=\exp\{\lambda(e^{s}-1)\}\). Reproductive: if \(X_i\sim \mathrm{Pois}(\lambda_i)\) independent then \(\sum_i X_i\sim \mathrm{Pois}(\sum_i\lambda_i)\). Extension (Poisson process): number of events in \([0,t]\) is \(N_t\sim \mathrm{Pois}(\lambda t)\).
If interarrival times of a Poisson process have rate \(\lambda>0\), then \(T\sim \mathrm{Exp}(\lambda)\) with cdf/pedf \[ F_T(t)=1-e^{-\lambda t}\ (t\ge0),\qquad f_T(t)=\lambda e^{-\lambda t}\ 1_{\{t\ge 0\}}. \]
For shape \(k>0\) and rate \(\lambda>0\), \[ X\sim \mathrm{Gamma}(k,\lambda)\quad\Longleftrightarrow\quad f_X(x)=\frac{\lambda e^{-\lambda x}(\lambda x)^{k-1}}{\Gamma(k)}\,1_{\{x>0\}}. \] If \(T_i\overset{\text{i.i.d.}}{\sim}\mathrm{Exp}(\lambda)\) then \(\sum_{i=1}^{n}T_i\sim \mathrm{Gamma}(n,\lambda)\).
Standard normal: \(Z\sim N(0,1)\) with \[ \varphi(z)=\frac{1}{\sqrt{2\pi}}e^{-z^2/2},\quad \Phi(z)=\int_{-\infty}^{z}\varphi(u)\,du. \] General normal: \(X\sim N(\mu,\sigma^2)\) has \[ f(x)=\frac{1}{\sqrt{2\pi}\,\sigma}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). \] Standardisation: \(Z=(X-\mu)/\sigma\sim N(0,1)\).
If a population has \(N\) items with \(m\) successes and \(N-m\) failures, and we draw \(n\) without replacement, then \(X\sim\mathrm{Hyp}(n,m,N)\) with \[ \mathbb P(X=x)=\frac{\binom{m}{x}\binom{N-m}{n-x}}{\binom{N}{n}},\qquad x=0,1,\dots,n, \] and \(\ \mathbb E[X]=\frac{mn}{N}\).
For \(\alpha,\beta>0\), \(X\sim\mathrm{Beta}(\alpha,\beta)\) on \((0,1)\) with \[ f(x)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)},\qquad B(\alpha,\beta)=\int_0^1 t^{\alpha-1}(1-t)^{\beta-1}\,dt. \]
Transformations of Random Variables
If \(Y=\phi(X)\) with \(\phi\) strictly monotone and differentiable, \[ f_Y(y)=f_X\!\big(\phi^{-1}(y)\big)\,\Big|\big(\phi^{-1}\big)’(y)\Big| = f_X\!\big(\phi^{-1}(y)\big)\,\left|\frac{1}{\phi’(\phi^{-1}(y))}\right|. \] (From \(F_Y\) by cases ↑/↓ and the inverse-function theorem.)
If \(Y=aX+b\) with \(a\neq0\) and \(X\) continuous, \[ f_Y(y)=\frac{1}{|a|}\,f_X\!\left(\frac{y-b}{a}\right). \]
If \(\phi\) is piecewise monotone with inverse branches \(\{\phi_k^{-1}:J_k\to I_k\}\), \[ f_Y(y)=\sum_k f_X\!\big(\phi_k^{-1}(y)\big)\, \Big|\big(\phi_k^{-1}\big)’(y)\Big|\;1_{\{y\in J_k\}}. \] (Useful e.g. \(Y=Z^2\) with \(Z\sim N(0,1)\Rightarrow Y\sim\chi^2_1\).)
If \(Y=\phi(X)\) where \(X=(X_1,X_2)\), \(Y=(Y_1,Y_2)\), \(\phi\) is one-to-one and differentiable with inverse \(\psi=\phi^{-1}\), then with Jacobian \(J(y)=\det\big[\partial \psi_i/\partial y_j\big]\), \[ f_Y(y)=f_X\!\big(\psi(y)\big)\,|J(y)|. \]
If \(X_1\perp\!\!\!\perp X_2\):
- Discrete: \(p_{X_1+X_2}(y)=\displaystyle\sum_{x} p_{X_1}(y-x)\,p_{X_2}(x)\).
- Continuous: \(f_{X_1+X_2}(y)=\displaystyle\int f_{X_1}(y-x)\,f_{X_2}(x)\,dx\).
If \(T_i\sim\mathrm{Exp}(\lambda_i)\) independent, \[ \min_i T_i\ \sim\ \mathrm{Exp}\!\Big(\sum_i\lambda_i\Big),\qquad \mathbb P\{\operatorname*{argmin}=j\}=\frac{\lambda_j}{\sum_i\lambda_i}. \]
If \(T_1,\dots,T_n\overset{\text{i.i.d.}}{\sim}\mathrm{Exp}(\lambda)\), then \(\sum_{i=1}^{n}T_i\sim\mathrm{Gamma}(n,\lambda)\).
If \(Z\sim N(0,1)\) and \(Y=e^{Z}\), then \[ f_Y(y)=\frac{1}{y\sqrt{2\pi}}\exp\!\left(-\frac{(\log y)^2}{2}\right),\quad y>0. \]
If \(Z\sim N(0,1)\) and \(Y=Z^2\), then \(Y\sim\chi^2_1\) with \[ f_Y(y)=\frac{1}{\sqrt{2\pi\,y}}\,e^{-y/2},\quad y>0. \] (Non-monotone transform handled by two branches \(Z=\pm\sqrt{y}\).)
If \(F_X\) is strictly increasing and \(U=F_X(X)\), then \(U\sim U(0,1)\).
Backlinks (3)
1. Statistics /wiki/mathematics/statistics/
This page pairs well with Probability.
Table of Distributions
| Distribution | mass/density function | \[S_X\] | \[\mathbb{E}(X)\] | \[\mathrm{Var}(X)\] | \[\phi_X(s)\] |
|---|---|---|---|---|---|
| Bernoulli \[\mathrm{Bern}(\pi)\] | \[P(X=1)=\pi\\ P(X=0)=1-\pi\] | \[\{0,1\}\] | \[\pi\] | \[\pi(1-\pi)\] | \[(1-\pi)+\pi e^{s}\] |
| Binomial \[\mathrm{Bin}(n,\pi)\] | \[p_X(x)=\binom{n}{x}\pi^{x}(1-\pi)^{n-x}\] | \[\{0,1,\dots,n\}\] | \[n\pi\] | \[n\pi(1-\pi)\] | \[(1-\pi+\pi e^{s})^{n}\] |
| Geometric \[\mathrm{Geo}(\pi)\] | \[p_X(x)=\pi(1-\pi)^{x-1}\] | \[\{1,2,\dots\}\] | \[\pi^{-1}\] | \[(1-\pi)\pi^{-2}\] | \[\frac{\pi}{e^{-s}-1+\pi}\] |
| Poisson \[\mathcal{P}(\lambda)\] | \[p_X(x)=e^{-\lambda}\lambda^{x}/x!\] | \[\{0,1,\dots\}\] | \[\lambda\] | \[\lambda\] | \[\exp\{\lambda(e^{s}-1)\}\] |
| Uniform \[U[\alpha,\beta]\] | \[f_X(x)=(\beta-\alpha)^{-1}\] | \[[\alpha,\beta]\] | \[\frac{1}{2}(\alpha+\beta)\] | \[\frac{1}{12}(\beta-\alpha)^2\] | \[\frac{e^{\beta s}-e^{\alpha s}}{s(\beta-\alpha)}\] |
| Exponential \[\mathrm{Exp}(\lambda)\] | \[f_X(x)=\lambda e^{-\lambda x}\] | \[[0,\infty)\] | \[\lambda^{-1}\] | \[\lambda^{-2}\] | \[\frac{\lambda}{\lambda-s}\] |
| Gaussian \[\mathcal{N}(\mu,\sigma^{2})\] | \[f_X(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{-\frac{(x-\mu)^2}{2\sigma^{2}}\right\}\] | \[\mathbb{R}\] | \[\mu\] | \[\sigma^{2}\] | \[e^{\mu s+\frac{1}{2}\sigma^{2}s^{2}}\] |
| Gamma \[\Gamma(\alpha,\lambda)\] | \[f_X(x)=\frac{1}{\Gamma(\alpha)}\lambda^{\alpha}x^{\alpha-1}e^{-\lambda x}\] | \[[0,\infty)\] | \[\alpha\lambda^{-1}\] | \[\alpha\lambda^{-2}\] | \[\left(\frac{\lambda}{\lambda-s}\right)^{\alpha}\] |
Statistical Inference
Let \(X=(X_1,\ldots,X_n)\) be i.i.d. from a parametric family \(\{F_\theta:\theta\in\Theta\subset\mathbb{R}^p\}\). The parameter \(\theta\) is unknown; inference uses the randomness of \(X\) to learn about \(\theta\).
2. Wiki /wiki/
Knowledge is a paradox. The more one understand, the more one realises the vastness of his ignorance.