23 Birthday Problems

Solutions

Q1

a)

We use the update rule \[x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\]

To produce \(x_1 = 1.5\) and \(x_2 = 1.416\).

Note to 3 decimal places \(\sqrt{2} \approx 1.412\).

b)

We can adjust our update rule to predict the roots of the derivative function¹, i.e. it's local optima: \[x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}\]

c)

Advantage: quadratic convergence.

Disadvantage: computing a second derivative may be difficult, think large and sparse matrices. Furthermore, an analytic approach may not exist.

Q2

a)

Let "SB" denote the phrase "share birthdays": \[\begin{align*} \mathcal{P}(\geq 2~\mathrm{SB}) &= \mathcal{P}(3~\mathrm{SB}) + \mathcal{P}(4~\mathrm{SB}) + \ldots + \mathcal{P}(N~\mathrm{SB})\\ &= 1 - \mathcal{P}(\text{nobody }\mathrm{SB})\\ \mathcal{P}(\text{nobody }\mathrm{SB}) &= \frac{365}{365}\times\frac{364}{365}\times\ldots\frac{365-n+1}{365}\\ &= \frac{365!}{365^n(365-n)!} \end{align*}\]

However, there is no obvious closed form value of \(n\) which satisfies \(\frac{365!}{365^n(365-n)!} < \frac{1}{2}\).

We can massage the equivalent LHS expression \(\frac{365}{365}\times\frac{364}{365}\times\ldots\frac{365-n+1}{365}\) into \[\prod_{k=0}^{n-1}\frac{365-k}{365}\]

which can then be expressed in code:

  import math

  def no_shared_birthday_prob(n):
    probability = 1.0
    for k in range(n):
      probability *= (365 - k) / 365
    return probability

  threshold = 1/2
  n = 1

  while no_shared_birthday_prob(n) >= threshold:
    n += 1

  print(f"thus the smallest n such that p(no_shared_birthday) <= 1/2 = {n}")

thus the smallest n such that p(no_shared_birthday) <= 1/2 = 23

b)

By the pigeonhole principle, it would be 366 people for a regular year, and 367 for a leap year.

Q3

a)

This matrix is non-square and thus does not possess an inverse.

b)

\[\begin{align*} \begin{bmatrix}0&2&4\\2&4&6\\4&6&8\end{bmatrix} &= 0\begin{vmatrix}4&6\\6&8\end{vmatrix} - 2\begin{vmatrix}2&6\\4&8\end{vmatrix} + 4\begin{vmatrix}2&4\\4&6\end{vmatrix} \\ &= -2(16-24) + 4(12-16)\\ &= 16 - 16 \\ &= 0 \end{align*}\]

Thus this matrix does not have an inverse.

c)

Since the determinant of a triangular matrix is equal to the product of the diagonal elements, we have a non-zero determinant and can perform the row-reduction on \(A|I\).

  from sympy import *

  m = Matrix([
    [4, 3, 2, 1, 1, 0, 0, 0],
    [0, 3, 2, 1, 0, 1, 0, 0],
    [0, 0, 2, 1, 0, 0, 1, 0],
    [0, 0, 0, 1, 0, 0, 0, 1]])
  print(m.rref())

\[ I|A^{-1} = \begin{bmatrix}\cfrac{1}{4}&-\cfrac{1}{4}&0&0\\0&\cfrac{1}{3}&-\cfrac{1}{3}&0\\0&0&\cfrac{1}{2}&-\cfrac{1}{2}\\0&0&0&1\end{bmatrix}\]

Q4

a)

\[3\]

b)

\[6\]

Q5

a)

The trace of a matrix is equal to the sum of the eigenvalues.

b)

The determinant of a matrix is equal to the product of its eigenvalues.

c)

\[\lambda_1 = -4, \lambda_2 = 6\]

d)

Eigendecompositions do not always exist, but Singular Value Decompositions do.

Q6

\[\begin{align*} A &= zz^T\\ A^T&=(zz^T)^T\\ &=zz^T\\ &=A\end{align*}\]

Furthermore, \[\begin{align*} \forall x\in \mathbb{R}^n,~ x^Tzz^Tx &= x^Tz^Tzx\\ &=\|zx\|^2_2\\ &\geq 0\qquad\quad\text{by definition of the L2 norm.}\end{align*}\]

Q7

a)

Separating and integrating: \[\begin{align*} \int\frac{\mathrm{d}y}{1-y} &= \int \cos(x) \mathrm{d}x\\ \implies\displaystyle y &= 1 - 2e^{\displaystyle 1-\sin(x)}\end{align*}\]

b)

Solving for the particular and homogenous solutions and combining them: \[\large y = \underbrace{Ae^{-7t} + Bte^{-7t}}_{y_h} + \underbrace{\frac{1}{64}e^t}_{y_p}\]

Q8

a)

\[\sum_{n=0}^\infty \frac{z^n}{z!} = z + \frac{z^2}{2!} + \frac{z^3}{3!} + \ldots \]

b)

\[\sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \ldots\]

c)

\[\cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \ldots\]

d)

\[\sinh(x) = x + \frac{x^3}{3!} + \frac{x^5}{5!} + \frac{x^7}{7!} + \ldots \]

e)

\[\cosh(x) = 1 + \frac{x^2}{2!} + \frac{x^4}{4!} + \frac{x^6}{6!} + \ldots\]

f)

\[\frac{1}{1-x} = 1 + x + x^2 + x^3 + \ldots\]

g)

\[\frac{1}{1-x} = 1 - x + x^2 - x^3 + \ldots\]

Q9

\[\large a_n = \frac{2^{(n+1)}}{(n+2)!}\]

Q10

Perform a couple of row reductions to make the matrix as triangular as possible:

L1 = L1
L2 = L2 - L1
L3 = L3 - L2 + L1
L4 = L4 + L1
L5 = L5 + L4

\[\begin{vmatrix}2&0&1&2&0\\2&-1&0&1&1\\0&1&2&1&2\\-2&0&2&-1&2\\2&0&0&1&1\end{vmatrix} \leadsto\begin{vmatrix} 2&0&1&2&0\\ 0&-1&-1&-1&1\\ 0&0&1&0&3\\ 0&0&3&1&2\\ 0&0&2&0&3 \end{vmatrix}\]

Then Laplace Expansion across the rows leads us to discover that the determinant is just: \[2\times -1\times \begin{vmatrix}1&0&3\\3&1&2\\2&0&3\end{vmatrix} = 6\]

Q11

\[\large\begin{align*} \sigma (z) &= \frac{1}{1+e^{-z}}\\ \sigma '(z) &= (1+e^{-z})^{-2} e^{-z}\\ &= \frac{e^{-z}}{(1+e^{-z})^2}\\ &= \frac{1}{1+e^{-z}} \times \frac{1+e^{-z}-1}{1+e^{-z}}\\ \sigma '(z) &= \sigma(z)(1-\sigma(z))\quad \small\square\end{align*}\]

Q12

a)

A single yes or no experiment. Something such as a coin toss.

\[P(X = x) = \begin{cases} p & \text{if } x = 1 \\ 1 - p & \text{if } x = 0\end{cases}\] where \( p \) is the probability of success, and \( X \in \{0, 1\} \).

You only have a single trial, and it is either successful or not.

b)

A series of the Bernoulli \(n\) independent yes, no experiments. Something such as counting the number of heads when you flip the coin \(N\) times.

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\] where:

\( n \): number of trials,
\( k \): number of successes,
\( p \): probability of success in each trial,
\( \binom{n}{k} = \frac{n!}{k!(n-k)!} \).

c)

A continuous distribution with a bell-shaped curve. Its PDF is governed by:

\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}\] where:

\( \mu \): mean (center of the distribution),
\( \sigma^2 \): variance (spread of the distribution).

And an example experiment which is has this probability distribution is the heights of people in the world.

d)

This distribution models the 80/20 phenomena. An example would be distribution of wealth (obviously). It's PDF is given by: \[f(x) = \frac{\alpha x_m^\alpha}{x^{\alpha+1}} \quad \text{for } x \geq x_m\] where:

\( \alpha > 0 \): shape parameter,
\( x_m > 0 \): scale parameter (minimum value).

e)

The geometric distribution models the number of trials required to obtain the first success in a sequence of indepedent Bernoulli trials. i.e. rolling a dice until you finally see a 6.

\[P(X = k) = (1-p)^{k-1}p\]

\( p \): probability of success in each trial,
\( k \): trial number of the first success (\( k = 1, 2, 3, \dots \)).

f)

This distribution models the number of events that happen in a fixed interval of time or space. i.e. the number of phone calls a service company receives in a 2 hour block.

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\] where:

\( \lambda > 0 \): average rate of events in a fixed interval,
\( k \): number of events (\( k = 0, 1, 2, \dots \)).

g)

This represents the experiments within which all outcomes are equally likely to occur. Like flipping a coin as heads² or rolling a 6 on a dice.

We have both a discrete and continuous case:

Discrete:

\[P(X=x)=\cfrac{1}{n}\quad\text{for }x\in \{x_1, x_2,\ldots,x_n\}\]

Continuous: \[f(x) = \begin{cases} \cfrac{1}{b-a} & \text{if } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}\] where:

\( a \): lower bound,
\( b \): upper bound.

Q13

\(n\choose r\) is choosing \(r\) objects from a set of \(n\) distinct objects.

Realising that \(r\) is of the same type as \(n\), i.e. if \(n\) is 5 shoes, \(r\) could be 1, 2, 3, 4 or 5 shoes.

Then \(n\choose r\) would grant you 5 options for the first shoe, then 4, etc. Thus we can choose \(n\times(n-1)\times\ldots(n-r+1)\) items.

However, since there were \(r\) slots for each shoe to be in, and these selections may be ordered \(r!\) ways without changing which set of shoes you end up with we divide through by \(r!\).

\[{n\choose r} = \cfrac{n\times(n-1)\times\ldots\times(n-r+1)}{r!}\] which can then be massaged by a clever trick of \(\times \cfrac{(n-r)!}{(n-r)!}\) such that we are allowed to continue the factorial of the numerator beyond \((n-r+1)\) to \(n(n-1)\times\ldots(n-r+1)(n-r)(n-r-1)\times\ldots\times 1\) which ultimately equals \(n!\).

The denominator then just accepts the multiplication to form \(r!(n-r)!\), such that our final closed form to \(n\choose r\) is

\[\frac{n!}{r!(n-r)!}\square\]

Q14

a)

\[\Large\begin{align*} \frac{\partial}{\partial{w_0}} \mathcal{L}(\vec{w}) &= \frac{\partial}{\partial{w_0}}~ \frac{1}{n}\sum(y_i-w_0-w_1x_i)^2 \\ 0&\stackrel{\text{(set)}}{=} -\frac{2}{n}\sum(y_i-w_0-w_1x_i) \\ &= -2\left[ \frac{1}{n}\sum y_i - \frac{1}{n}\sum w_0 - \frac{1}{n}\sum w_1 x_i \right ] \\ &= -2 \left[ \bar{y} - w_0 - w_1 \bar{x} \right ]\\ \therefore w_0 &= \bar{y}-w_1\bar{x}\end{align*}\]

b)

\[\Large\begin{align*} \frac{\partial}{\partial{w_1}} \mathcal{L}(\vec{w}) &= \frac{\partial}{\partial{w_1}}~ \frac{1}{n}\sum(y_i-w_0-w_1x_i)^2 \\ 0&\stackrel{\text{(set)}}{=} -\frac{2}{n}\sum(y_i-w_0-w_1x_i)x_i\\ &= -2\left[ \frac{1}{n}\sum y_ix_i -\frac{1}{n}\sum w_0x_i -\frac{1}{n}\sum w_1x_i^2 \right ] \\ &= \overline{xy} -w_0 \overline{x} -w_1 \overline{x^2} \\ \therefore w_1 &= \frac{\overline{xy}-w_0\overline{x}}{\overline{x^2}} \end{align*}\]

c)

Substitute \(w_0\) into \(w_1\): \[\Large\begin{align*} w_1 &= \frac{\overline{xy}-(\overline{y}-w_1\overline{x})\overline{x}}{\overline{x^2}} \\ &= \frac{\overline{xy}-\bar{x}\bar{y}+w_1\overline{x}^2}{\overline{x^2}}\\ \implies w_1\overline{x^2} - w_1\overline{x}^2 &= \overline{xy} - \bar{x}\bar{y} \\ w_1(\overline{x^2}-\overline{x}^2) &= \overline{xy} - \bar{x}\bar{y} \\ w_1 &= \frac{\overline{xy}-\bar{x}\bar{y}}{\overline{x^2}-\overline{x}^2} \\ \implies w_0 &= \overline{y} - \left ( \frac{\overline{xy}-\bar{x}\bar{y}}{\overline{x^2}-\overline{x}^2}\overline{x} \right ) \\ \text{and thus }\quad \hat{y}(x) &= \overline{y} - \left ( \frac{\overline{xy}-\bar{x}\bar{y}}{\overline{x^2}-\overline{x}^2}\overline{x} \right ) + \frac{\overline{xy}-\bar{x}\bar{y}}{\overline{x^2}-\overline{x}^2}x \end{align*}\]

Q15

a)

\(\large\text{RTS: } \nabla_{\vec{x}} \vec{b}^T \vec{x} = \vec{b}\).

\[\large\begin{align*} \vec{b}^T \vec{x} &= 1\left\{\vphantom{\begin{bmatrix} b_1 & b_2 & \ldots & b_n \end{bmatrix}}\right. \underbrace{\begin{bmatrix} b_1 & b_2 & \ldots & b_n \end{bmatrix}}_{n} \overbrace{\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}}^{1} \left.\vphantom{\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}}\right\}n\\ &= b_1 x_1 + \ldots + b_n x_n \\ \implies \frac{\partial}{\partial x_i} \vec{b}^T \vec{x} &= b_i \\ \therefore \quad \nabla_{\vec{x}} \vec{b}^T \vec{x} &= \vec{b} \, \square \end{align*}\]

b)

\(\large\text{RTS: } \nabla_{\vec{x}} \vec{x}^T\vec{A}\vec{x} = 2\vec{A}\vec{x}\).

\[\large\begin{align*} \vec{x}^T \vec{A}\vec{x} &= \begin{bmatrix}x_1&x_2&\ldots &x_n\end{bmatrix} \begin{bmatrix}A_{1,1}&\dots &A_{1,n}\\A_{2,1}&\dots &A_{2,n}\\\vdots &\ddots &\vdots\\A_{n,1}&\dots &A_{n,n}\end{bmatrix} \begin{bmatrix}x_1\\x_2\\\vdots \\x_n\end{bmatrix} \\ &= \begin{bmatrix}A_{1,1}x_1+A_{2,1}x_2+\ldots+A_{n,1}x_n&A_{1,2}x_1+A_{2,2}x_2+\ldots +A_{n,2}x_n&A_{1,n}x_1 +\ldots A_{n,n}x_n\end{bmatrix} \begin{bmatrix}x_1\\x_2\\\vdots \\x_n\end{bmatrix} \\ &= x_1[A_{1,1}x_1+A_{2,1}x_2+\ldots+A_{n,1}x_n] + x_2[A_{1,2}x_1+A_{2,2}x_2+\ldots +A_{n,2}x_n] + \ldots + x_n [A_{1,n}x_1 +\ldots A_{n,n}x_n] \\ &= [A_{1,1}x_1^2+A_{2,1}x_1x_2+\ldots+A_{n,1}x_1x_n] + [A_{1,2}x_1x_2+A_{2,2}x_2^2+\ldots +A_{n,2}x_2x_n] + [\ldots] + [A_{1,n}x_1x_n +\ldots A_{n,n}x_n^2] \\ \implies \frac{\partial}{\partial{x_i}} \nabla_{\vec{x}} \vec{x}^T\vec{A}\vec{x} &= 2A_{i,i}x_i + \sum_{j\neq i}^n A_{j,i}x_j + \sum_{j\neq i}^n A_{i,j}x_j \\ &= \sum_{j=1}^n 2 A_{j,i}x_i\quad\text{by the symmetry of A} \\ \text{Thus, }\nabla_{\vec{x}} \vec{x}^T\vec{A}\vec{x} = 2\vec{A}\vec{x}\square. \end{align*}\]

c)

\[\large\begin{align*} \mathcal{L}(\vec{w}) &= \frac{1}{n} \|\vec{y}-\vec{X}\vec{w}\|_2^2 \\ &= \frac{1}{n} (\vec{y}-\vec{X}\vec{w})^T(\vec{y}-\vec{X}\vec{w}) \\ &= \frac{1}{n} \left( \vec{y}^T\vec{y} - \vec{y}^T\vec{X}\vec{w} - \vec{w}^T\vec{X}^T\vec{y} + \vec{w}^T\vec{X}^T\vec{X}\vec{w} \right) \\ &= \frac{1}{n} \left( \vec{y}^T\vec{y} - 2\vec{y}^T\vec{X}\vec{w} + \vec{w}^T\vec{X}^T\vec{X}\vec{w} \right) \quad \text{because both } \vec{y}^T\vec{X}\vec{w} \text{ and } \vec{w}^T\vec{X}^T\vec{y} \text{ are scalars} \\ \text{then } \nabla_{\vec{w}} \frac{1}{n} \left( \vec{y}^T\vec{y} - 2\vec{y}^T\vec{X}\vec{w} + \vec{w}^T\vec{X}^T\vec{X}\vec{w} \right) &= \nabla_{\vec{w}} \frac{1}{n} \left( \vec{y}^T\vec{y} - 2(\vec{X}^T\vec{y})^T\vec{w} + \vec{w}^T(\vec{X}^T\vec{X})\vec{w} \right) \\ \text{and } 0 &\stackrel{\text{(set)}}{=} -2\vec{X}^T\vec{y} + 2\vec{X}^T\vec{X}\vec{w} \quad \text{(by applying the derivative rules of parts a, b)} \\ \text{such that, } \vec{w} &= (\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y} \end{align*}\]

Q16

From the previous part we have \(\mathcal{L}(\vec{w}) = -2\vec{X}^T\vec{y} + 2\vec{X}^T\vec{X}\vec{w}\).

\[\begin{align*} &= 2(\vec{X}^T\vec{X})^T\vec{w} - 2\vec{X}^T\vec{y} \\ \nabla_w^2 &= 2\vec{X}^T\vec{X} \\ &= 2(\vec{X}^T\vec{X})^T\quad\text{thus proving the first condition of a positive semidefinite matrix} \\ \text{then for any arbitrary }z\in\mathbb{R}^n:\\ \vec{z}^T2\vec{X}^T\vec{X}\vec{z} &= 2(\vec{X}\vec{z})^T\vec{X}\vec{z} \\ &= 2 \|\vec{X}\vec{z}\|^2_2 \\ &\geq 0. \end{align*}\] Thus proving the second condition for PSD-ness, and confirming that our critical point from 15c) was a minimum.

Q17

Recall the update rule for our gradient function from Question 1b: \[x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}\]

Thus \[\large\begin{align*} \vec{w}_1 &= \vec{w}_0 - \frac{f'(\vec{w}_0)}{f''(\vec{w}_0)} \\ &= \vec{w} - \frac{\nabla_{\vec{w}}\mathcal{L}(\vec{w})}{\nabla^2_{\vec{w}}\mathcal{L}(\vec{w})} \\ &= \vec{w} - \frac{-2\vec{X}^T\vec{y} + 2\vec{X}^T\vec{X}\vec{w}}{2\vec{X}^T\vec{X}} \\ &= \rlap{\(\vec{w}\)}\textcolor{red}{\cancel{\phantom{\vec{w}}}} + \frac{\vec{X}^T\vec{y}}{\vec{X}^T\vec{X}}- \rlap{\(\vec{w}\)}\textcolor{red}{\cancel{\phantom{\vec{w}}}} \\ &= (\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y}\quad{\small\square}. \end{align*}\]

Thus it only takes a single iteration of the Newton-Raphson method to converge to the optimal minima of the least squares loss function!

Q18

a)

In this particular case, the parameter we wish to predict is \(p = \theta\) where \(\Theta \in [0,1]\). Also, \[\begin{align*} \mathcal{L}(\theta) &= \text{prob all }X_i\text{ under probability distribution }p.\\ &= \text{prob}(X_1)\times \text{prob}(X_2)\times \dots \times \text{prob}(X_n)\quad\text{under the assumption that all }X_i\text{ are i.i.d}\\ &= \prod_{i=1}^n p_\theta(X_i)\end{align*}\]

So we have \[\begin{align*} \operatorname*{arg\,max}_{p} \mathcal{L}(p) &=\operatorname*{arg\,max}_{p}\left[ \prod p^{X_i}(1-p)^{1-X_i}\right]\\ &= \operatorname*{arg\,max}_{p} \left[\log\left(\prod p^{X_i}(1-p)^{1-X_i} \right)\right]\quad\text{by virtue of a monotonically function (log) affecting the max, but never the argmax of the orginal function}\\ &= \operatorname*{arg\,max}_{p} \left[\sum \log(p^{X_i}(1-p)^{1-X_i} )\right] \\ &= \operatorname*{arg\,max}_{p} \left[\sum X_i\log(p) + (1-X_i)\log(1-p)]\right]\\ &= \operatorname*{arg\,max}_{p} \left[n\bar{X}\log(p) + n(1-\bar{X})\log(1-p)\right]\\ 0 &\stackrel{\text{(set)}}{=} \frac{\partial}{\partial{p}} \left[n\bar{X}\log(p) + n(1-\bar{X})\log(1-p)\right] \\ &= \frac{n\bar{X}}{p}-\frac{n(1-\bar{X})}{1-p} \\ &= \frac{(1-p)n\bar{X}n\bar{X} - n(1-\bar{X})p}{p(1-p)} \\ 0 &= n\bar{X} - \rlap{\(pn\bar{X}\)}\textcolor{red}{\cancel{\phantom{pn\bar{X}}}} -np + \rlap{\(n\bar{X}p\)}\textcolor{red}{\cancel{\phantom{n\bar{X}p}}} \\ &= n(\bar{X}-p) \\ \implies p &= \bar{X} \end{align*}\]

b)

Given that \[\mathcal{N}(\mu,\sigma^2) = \cfrac{1}{\sqrt{2\pi}\sigma}e^{-\cfrac{(x-\mu)^2}{2\sigma^2}}\] We can apply the same logic as in part a to construct our argmax formulation: \[\large\begin{align*} \mathcal{L}(\mu,\sigma^2) &= \prod_{i=0}^n \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(X_i-\mu)^2}{2\sigma^2}} \\ \implies\operatorname*{arg\,max}_{\mu} \mathcal{L}(\mu,\sigma^2) &= \operatorname*{arg\,max}_{\mu} \left[\prod \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(X_i-\mu)^2}{2\sigma^2}} \right]\\ &= \operatorname*{arg\,max}_{\mu} \left[\sum\log\left(\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(X_i-\mu)^2}{2\sigma^2}}\right) \right]\\ &= \operatorname*{arg\,max}_{\mu} \left[\sum\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right)+\log\left(e^{-\frac{(X_i-\mu)^2}{2\sigma^2}}\right) \right]\\ &= \operatorname*{arg\,max}_{\mu} \left[\sum\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right)-\frac{(X_i-\mu)^2}{2\sigma^2} \right]\quad(1)\\ 0 &\stackrel{\text{(set)}}{=} \frac{\partial}{\partial \mu} \left[ \sum \frac{\rlap{\(2\)}\textcolor{red}{\cancel{\phantom{2}}}(X_i - \mu)}{\rlap{\(2\)}\textcolor{red}{\cancel{\phantom{2}}}\sigma^2} \right] \\ &= \sum \frac{X_i}{\sigma^2} - \frac{\mu}{\sigma^2} \\ \implies 0 &= \frac{n\bar{X}-n\mu}{\sigma^2} \\ \therefore \mu_{\mathrm{MLE}} = \bar{X} \end{align*}\]

Then to solve for the \(\sigma^2_{\mathrm{MLE}}\) we can continue stretch out the expression in (1) and take a different derivative in the proceeding line (w.r.t \(\sigma^2\)). \[\large\begin{align*} \operatorname*{arg\,max}_{\sigma^2} \left[\sum\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right)-\frac{(X_i-\mu)^2}{2\sigma^2} \right] &=\operatorname*{arg\,max}_{\sigma^2} \left[\sum \log(1) - \frac{1}{2}\log(2\pi\sigma^2) -\frac{(X_i-\mu)^2}{2\sigma^2}\right] \\ &= \operatorname*{arg\,max}_{\sigma^2} \left[\sum -\frac{1}{2} \left( \log(2\pi) + \log(\sigma^2) \right) -\frac{(X_i-\mu)^2}{2\sigma^2}\right] \\ 0 &\stackrel{\text{(set)}}{=} \frac{\partial}{\partial \sigma^2} \sum -\frac{1}{2} \left( \log(2\pi) + \log(\sigma^2) \right) -\frac{(X_i-\mu)^2}{2\sigma^2} \\ &= \sum -\frac{1}{2\sigma^2} + \frac{(X_i -\mu)^2}{2\sigma^4} \\ &= -\frac{n}{\sigma^2} + \frac{\sum(X_i-\mu)^2}{2\sigma^4} \\ &= \frac{-\sigma^2 n + \sum(X_i - \mu)^2}{2\sigma^4} \\ \implies \sigma^2_\mathrm{MLE} &= \frac{\sum(X_i-\mu)^2}{n} \\ &= \frac{1}{n} \sum(X_i -\bar{X})^2\quad(\mu_\mathrm{MLE} = \bar{X}, \text{ from part (a)}) \end{align*}\]

Q19

MLE for Linear Regression with Normally Distributed Errors:

\[\large\begin{align*} y^{(i)} &= {\vec{x}^{(i)}}^T\vec{w} + \epsilon^{(i)}\quad,\epsilon\sim\mathcal{N}(0,\sigma^2) \\ \implies & y^{(i)} | \vec{x}^{(i)}; \vec{w} \sim \mathcal{N}({\vec{x}_{(i)}}^T\vec{w}, \sigma^2) \\ \iff & p(y^{(i)} | \vec{x}^{(i)}; \vec{w}) = \cfrac{1}{\sqrt{2\pi}\sigma}e^{-\cfrac{(y^{(i)}-{\vec{x}_{(i)}}^T\vec{w})^2}{2\sigma^2}} \\ \mathcal{L}(w) &= \prod p(y^{(i)} | {\vec{x}_{(i)}}^T \vec{w}) \\ \log\mathcal{L}(w) &= \log\left(\prod p(y^{(i)} | {\vec{x}_{(i)}}^T \vec{w})\right)\quad\text{(henceforth the log likelihood will be denoted \(\mathcal{L}\mathcal{L}\))} \\ \text{Consquently }\vec{w}_\mathrm{MLE} &= \operatorname*{arg\,max}_{\vec{w}} \left[\mathcal{L}\mathcal{L}(\vec{w})\right] \\ &= \operatorname*{arg\,max}_{\vec{w}} \left[\sum \log(1) - \frac{1}{2}\log(2\pi\sigma^2) -\frac{(y^{(i)}-{\vec{x}_{(i)}}^T\vec{w})^2}{2\sigma^2}\right] \\ &= \operatorname*{arg\,max}_{\vec{w}} \left[\frac{n}{2}\log(2\pi\sigma^2) + \frac{\|y-\vec{X}\vec{w}\|^2_2}{2\sigma^2}\right] \\ &= (\vec{X}^T\vec{X})^{-1}\vec{X}^Ty \\ &= \vec{w}_\mathrm{MSE}\quad\text{coincidence?} \end{align*}\]

Q20

a)

A parametric model learns a series of turnable knobs from the data; these knobs are then carried around for inference. Dissimilarly, a non-parametric model performs inferences by way of direct comparison against the data. The benefit of this is that this type of modelling requires minimal training, however the caveat is that you must carry around all of the data for inference!

b)

We minimize the weighted least squares objective: \[\mathcal{L}(\vec{w}) = \sum_{i=1}^n w^{(i)}\left(y^{(i)} - \vec{x}^{(i)}\vec{w}\right)^2.\]

Expand and differentiate w.r.t. \(\vec{w}\): \[\mathcal{L}(\vec{w}) = \vec{y}^T W \vec{y} - 2 \vec{y}^T W X \vec{w} + \vec{w}^T X^T W X \vec{w}.\]

Set the gradient to 0: \[-2X^T W \vec{y} + 2X^T W X \vec{w} = 0 \implies \vec{w} = \left(X^T W X\right)^{-1} X^T W \vec{y}.\]

c)

Computing \(\hat{y}\) at \(x = 0.5\):

Given data: \[\{(x^{(i)}, y^{(i)})\} = \{(-1, 1), (0, 0), (1, 1)\}, \quad \tau = 1, \quad \vec{x} = [1, x].\]

Weights: \[\begin{align*} w^{(i)}&= \exp\left(-\frac{(x^{(i)} - 0.5)^2}{2(1)^2}\right)\\ \text{such that }w^{(1)} &= \exp\left(-\frac{(-1 - 0.5)^2}{2}\right) = \exp(-1.125) \\ w^{(2)} &= \exp\left(-\frac{(0 - 0.5)^2}{2}\right) = \exp(-0.125) \\ w^{(3)} &= \exp\left(-\frac{(1 - 0.5)^2}{2}\right) = \exp(-0.125) \\ \text{thus, }W &= \text{diag}([w^{(1)}, w^{(2)}, w^{(3)}]) \end{align*}\]
Matrices: \[X = \begin{bmatrix}1 & -1 \\1 & 0 \\1 & 1\end{bmatrix} , \quad \vec{y} = \begin{bmatrix}1 \\0 \\1\end{bmatrix}.\]
Weighted least squares solution: \[\vec{w} = \left(X^T W X\right)^{-1} X^T W \vec{y} = \begin{bmatrix}0.518\\0.223\end{bmatrix} \]
Predicted value: \[\hat{y} = \vec{x} \cdot \vec{w} \approx 0.63{\scriptsize\square}\]

  import numpy as np
  from numpy.linalg import inv
  X = np.array([[1,-1],[1,0],[1,1]])
  y = np.array([1,0,1])
  W = np.diag(np.array([np.exp(-1.125),np.exp(-0.125),np.exp(-0.125)]))
  w = inv(X.T @ W @ X)@X.T@W@y
  print(w)
  x_i = np.array([1,0.5])
  y_i = w @ x_i
  print(y_i)

[0.51825006 0.22262491]
0.6295625143296497

Q21

A hyper-plane

Q22

The points had initially been coloured at random, but clearly there is a circular pattern to the data. As a result of this, constructing some kind of linear-decision boundary in Euclidean space would hardly be a good idea.

As such, we can apply a second-degree polynomial kernel and then follow that up with a k-means clustering algorithm to produce:

svm.SVC(kernel='poly',degree=2,coef0=2,gamma=1,C=1e8)

Q23

\[\begin{align*} \vec{w}_\mathrm{ridge} &= \operatorname*{arg\,max}_{\vec{w}} \left[\|\vec{y}-\vec{X}\vec{w}\|^2_2 + \lambda\|\vec{w}\|^2_2\right] \\ &= \operatorname*{arg\,max}_{\vec{w}}\left[ (\vec{y}-\vec{X}\vec{w})^T(\vec{y}-\vec{X}\vec{w}) + \lambda\vec{w}^T\vec{w}\right] \\ &= \frac{\partial}{\partial{\vec{w}}} \vec{y}^T\vec{y} -2(\vec{X}^T\vec{y})^T + \vec{w}^T\vec{X}^T\vec{X}\vec{w} + \lambda\vec{w}^T\vec{w} \\ &= -2\vec{X}^T\vec{y} + 2\vec{X}^T\vec{X}\vec{w} + 2\lambda\vec{w} \\ &\stackrel{\text{(set)}}{=} 0 \\ \implies w_\mathrm{ridge} &= (\vec{X}^T\vec{X}+\lambda \vec{I})^{-1}(\vec{X}^T\vec{y}) \end{align*}\]

Footnotes

as opposed to the roots of the original function (x-intercepts).

note, this is mathematically equivalent to the Bernoulli case, just that the emphasis there would be on binary nature of the outcome, whilst the Uniform distribution focuses on just that — the uniformity of the outcomes.