Mathematics for Machine Learning
Linear Algebra
We have to check the five properties listed in Definition 2.7. Firstly, let's look at closure. Our operation should take two elements of $\mathbb{R}\setminus\{-1\}$, and spit out a real number which isn't $-1$. Clearly we will get a real answer, so let's suppose we can get $-1$, and we'll hopefully find a contradiction. Let $a,b \in \mathbb{R}\setminus\{-1\}$ be such that $-1 = ab + a+ b$. Then $ab+a+b+1 = 0$, so $(a+1)(b+1) = 0$. Therefore, either $a=-1$ or $b=-1$, which is a contradiction.
Now, we attack associativity. We use the fact that $+$ is associative and commutative on $\mathbb{R}$. We have
\begin{aligned} (a\star b) \star c & = (ab+a+b) \star c \\ & = (ab+a+b)c + (ab+a+b) + c \\ & = abc + ac + bc + ab + a + b + c \\ & = abc + ab + ac + a + bc + b + c \\ & = a(bc+b+c) + a + (bc+b+c) \\ & = a(b\star c)+a + (b\star c) \\ & = a\star (b\star c). \end{aligned}Observe that $a\star 0 = a \cdot 0 +a+0 = a$, and similarly $0\star a = a$, so $0$ is the neutral element.
Fourthly, we take $x \star y = 0$ (the neutral element we found in the previous step). Solving for $y$ in terms of $x$, we find that $ y = \frac{-x}{x+1}, $ which always exists, since $x\neq -1$. Thus such a $y$ is the inverse element for $x \in \mathbb{R}\setminus\{-1\}$. This concludes the argument that $(\mathbb{R}\setminus\{-1\},\star)$ is a group.
Finally, if we observe that $a\star b = b \star a$, by the commutativity of multiplication and addition of real numbers, then we can conclude that $(\mathbb{R}\setminus\{-1\},\star)$ is an $abelian$ group.
The major thing we need to check here is whether the operation $\oplus$ is well-defined. That is to say, if we take two different representatives of the same congruence classes, do we definitely get the same answer at the end? Let's take $a_1, a_2 \in \mathbb{Z} $ such that $\overline{a_1} = \overline{a_2}$, and similarly let $b_1, b_2 \in \mathbb{Z} $ such that $\overline{b_1} = \overline{b_2}$. We need to show that $\overline{a_1}\oplus \overline{b_1} = \overline{a_2} \oplus \overline{b_2}$. In other words, we need to show that $\overline{a_1+b_1} = \overline{a_2+b_2}$.
By the definition of the congruence class, two classes $\overline{a_1}$ and $\overline{a_2}$ are equal in $\mathbb{Z}_n$ if and only if $a_1 - a_2$ is a multiple of $n$. Let's say $a_1 - a_2 = k_1 n$ and ${b_1 - b_2 = k_2 n}$, for some $k_1, k_2 \in \mathbb{Z}$. Observe then that $(a_1+b_1)-(a_2+b_2) = (k_1+k_2)n$, and so $\overline{a_1+b_1} = \overline{a_2+b_2}$ indeed!
From here, we use properties of addition in the integers to deduce that $(\mathbb{Z}_n,\oplus)$ is an abelian group.
$\overline{a}\oplus\overline{b} = \overline{a+b}\in \mathbb{Z}_n$
$(\overline{a}\oplus \overline{b}) \oplus \overline{c} = \overline{a+b} \oplus \overline{c} = \overline{(a+b)+c} = \overline{a+(b+c)} = \overline{a} \oplus \overline{b+c} = \overline{a}\oplus (\overline{b} \oplus \overline{c})$, where the middle equality follows from the associativity of $\mathbb{Z}$.
$\overline{0}$ is the neutral element, since $\overline{0} \oplus \overline{a} = \overline{0+a} = \overline{a}$, and similarly the other way round.
Given $\overline{a} \in \mathbb{Z}_n$, we have that $\overline{-a} = \overline {n-a}$ is the inverse element, since $\overline{a} \oplus \overline{-a} = \overline{a-a} = \overline{0}$, indeed.
Finally, $\overline{a}\oplus \overline{b} = \overline{a+b} = \overline{b+a} = \overline{b} \oplus \overline{a}$, so the group is abelian!
If $n$ is prime, then every integer from $1$ to $n-1$ will be relatively prime to $n$. Let $a$ be such an integer. Then, by B\'ezout's theorem, there exist integers $u, v$ such that $au+nv=1$. If we take this identity mod $n$, we have $\overline{au} \oplus \overline{nv} = \overline{1}$. But $nv$ is a multiple of $n$, so $\overline{nv} = \overline{0}$, and this simplifies to $\overline{a} \otimes \overline{u} = \overline{1}$ – in other words, $\overline{u}$ is the inverse of $\overline{a}$. Also, if we take two integers $a$ and $b$, both of which are relatively prime to $n$, then the product $ab$ also cannot contain a factor of $n$, since $n$ is prime, so the operation is closed. The other properties follow immediately. This shows that we have a group. $\lozenge$
If $n$ is not prime, then we can say $n=ab$, where $a,b\in \mathbb{Z}$, with $1<a\leq b<n$. But then $\overline{a}\otimes \overline{b} = \overline{ab} = \overline{n} = \overline{0}\notin \mathbb{Z}_n\setminus\{\overline{0}\}$, and so the operation is not closed.
Let $A_1 = \begin{bmatrix} 1 & x_1 & y_1 \\ 0 & 1 & z_1 \\ 0 & 0 & 1 \end{bmatrix}$ and $A_2 = \begin{bmatrix} 1 & x_2 & y_2 \\ 0 & 1 & z_2 \\ 0 & 0 & 1 \end{bmatrix}\in \mathcal{G}$.
Then $A_1 A_2 =\begin{bmatrix} 1 & x_1+x_2 & y_1+x_1z_2+y_2 \\ 0 & 1 & z_1+z_2 \\ 0 & 0 & 1 \end{bmatrix}\in\mathcal{G}$, so we have closure.
Associativity follows from the associativity of standard matrix multiplication.
Letting $x=y=z=0$, observe that the identity is in $\mathcal{G}$.
Finally, if we take $x_2 = -x_1$, $z_2 = -z_1$, and $y_2 = -y_1-x_1z_2$, then observe that $A_1A_2 = I_3$, and thus inverses are of the required form! Therefore, $\mathcal{G} $ is a group.
The group is not abelian, e.g. take $x_1=z_2=1$ and everything else to be $0$. Then multiplying these matrices in the other order (i.e. $x_2=z_1=1$) gives a different answer.
Let's do some Gaussian elimination. We start with $$\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 2&5&-7&-5&-2\\ 2&-1&1&3&4\\ 5&2&-4&2&6 \end{array}\right]$$.
Taking $r_2-2r_1$, $r_3-2r_1$ and $r_4-5r_1$, we have $$\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&-3&3&5&2\\ 0&-3&1&7&1 \end{array}\right]$$.
Now taking $r_3+r_2$ and $r_4+r_2$, we have $$\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&0&-2&2&-2\\ 0&0&-4&4&-3 \end{array}\right]$$.
Finally, taking $r_4-2r_3$, we have $$\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&0&-2&2&-2\\ 0&0&0&0&1 \end{array}\right]$$.
Observe that the rank of the augmented matrix (4) is greater than rank of matrix of coefficients (3), so the system has no solution.
We proceed with some more Gaussian elimination. From the start, we do $r_2-r_1$, $r_3-2r_1$, and $r_4+r_1$, to obtain $$\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&2&0&-3&-1&3\\ 0&1&0&1&-3&-1\\ 0&1&0&-2&0&2 \end{array}\right]$$.
From here, do $r_2-2r_3$ and $r_4-r_3$ to obtain $$\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&0&0&-5&5&5\\ 0&1&0&1&-3&-1\\ 0&0&0&-3&3&3 \end{array}\right]$$.
Next, if we divide $r_2$ by $-5$, and then do $r_4+3r_2$, we have $$\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&0&0&1&-1&-1\\ 0&1&0&1&-3&-1\\ 0&0&0&0&0&0 \end{array}\right]$$.
Now, we do $r_1+r_3$, then swap $r_2$ and $r_3$, to give $$\left[ \begin{array}{ccccc|c} 1&0&0&1&-2&2\\ 0&1&0&1&-3&-1\\ 0&0&0&1&-1&-1\\ 0&0&0&0&0&0 \end{array}\right]$$.
Finally, let's do $r_1-r_3$ and $r_2-r_3$, to obtain $$\left[ \begin{array}{ccccc|c} 1&0&0&0&-1&3\\ 0&1&0&0&-2&0\\ 0&0&0&1&-1&-1\\ 0&0&0&0&0&0 \end{array}\right]$$.
Let's turn these back into equations. We have $x_1-x_5=3$, $x_2-2x_5=0$ and $x_4-x_5=-1$. Thus we can take $x_3$ and $x_5$ to be arbitrary, and then the others are determined. This gives a solution set of $\{ (\alpha+3, 2\alpha, \beta, \alpha-1, \alpha)^{\mathsf{T}} : \alpha, \beta \in \mathbb{R} \}$.
We start with doing $r_3-r_1$, to give us $$\left[ \begin{array}{cccccc|c} 0&1&0&0&1&0&2\\ 0&0&0&1&1&0&-1\\ 0&0&0&0&-1&1&-1 \end{array}\right]$$.
Next, do $r_1+r_3$, $r+2+r_3$, and then multiply $r_3$ by $-1$. This gives $$\left[ \begin{array}{cccccc|c} 0&1&0&0&0&1&1\\ 0&0&0&1&0&1&-2\\ 0&0&0&0&1&-1&1 \end{array}\right]$$.
This corresponds to the equations $x_2+x_6=1$, $x_4+x_6=-2$, and $x_5-x_6 = 1$. Now we can take $x_1$, $x_3$ and $x_6$ to be arbitrary, giving a solution set of $\{ (\alpha, 1-\beta, \gamma, -2-\beta, 1+\beta, \beta)^{\mathsf{T}} : \alpha,\beta, \gamma \in \mathbb{R} \}$.
Here, we are looking for eigenvectors associated with the eigenvalue $12$. Indeed, we don't yet know that $12$ is an eigenvalue, but if we find ourselves unable to find such an eigenvector, then we can answer that the solution set will be empty.
We have that $A-12I = \begin{bmatrix}-6&4&3\\ 6&-12&9\\ 0&8&-12 \end{bmatrix}$, and observe that $(3,3,2)^{\mathsf{T}} \in \ker(A-12I)$, so this is an eigenvector (and $12$ is an eigenvalue indeed!). Finally, we want an eigenvector such that the sum of the components is $1$, so simply divide the eigenvector above by $3+3+2=8$.
We perform Gaussian elimination on $$\left[ \begin{array}{cccc|cccc} 1&&1&&1&&&\\ &1&1&&&1&&\\ 1&1&&1&&&1&\\ 1&1&1&&&&&1 \end{array}\right]$$, where a blank space denotes a 0.
Firstly, with $r_3-r_1$, $r_4-r_1$, $r_3-r_2$, and $r_4-r_2$, we have $$\left[ \begin{array}{cccc|cccc} 1&&1&&1&&&\\ &1&1&&&1&&\\ &&-2&1&-1&-1&1&\\ &&-1&&-1&-1&&1 \end{array}\right]$$.
Then with $r_1+r_4$, $r_2+r_4$, and $r_3-2r_4$, we have $$\left[ \begin{array}{cccc|cccc} 1&&&&&-1&&1\\ &1&&&-1&&&1\\ &&&1&1&1&1&-2\\ &&-1&&-1&-1&&1 \end{array}\right]$$.
Finally, swapping $r_3$ and $r_4$, then multiplying $r_3$ by $-1$, we have $$\left[ \begin{array}{cccc|cccc} 1&&&&&-1&&1\\ &1&&&-1&&&1\\ &&1&&1&1&&-1\\ &&&1&1&1&1&-2 \end{array}\right]$$.
The matrix to the right of the vertical line is the inverse of $A$.
Let's form the equation $\alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 = 0$. These three vectors are linearly independent if and only if the $only$ solution to this equation is $\alpha_1=\alpha_2=\alpha_3=0$.
Looking at the third component, we have that $\alpha_1\cdot 1 + \alpha_2 \cdot 0 + \alpha_3 \cdot 0 = 0$, that is to say, $\alpha_1 = 0$.
Next, look at the second component. We already know $\alpha_1 = 0$, so we have $\alpha_2 \cdot 1 + \alpha_3 \cdot 0 = 0$, that is to say $\alpha_2=0$ also.
Finally, look at the first component. We have that $\alpha_3 \cdot 1 = 0$, so all of the $\alpha_i$'s are zero. Therefore, our three vectors are linearly independent.
Here, we need to solve the simultaneous equations $\alpha_1+\alpha_2+2\alpha_3 = 1$, $\alpha_1+2\alpha_2-\alpha_3=-2$, and $\alpha_1+3\alpha_2+\alpha_3 = 5$.
Performing Gaussian elimination in the usual way, we determine that $\alpha_1 = -6$, $\alpha_2 = 3$, and $\alpha_3 = 2$. That is to say, $y=-6x_1+3x_2+2x_3$.
We write the given vectors as $v_1, \dots v_6$ from left to right. Firstly, observe that $\dim(U_1)=2$ and $\dim(U_2)=2$ (compute the rank of $\left[v_1|v_2|v_3\right]$, then $\left[v_4|v_5|v_6\right]$). Since we can write $v_3= \frac13 (v_1-2v_2)$ and $v_6=-v_4-2v_5$, we need not consider $v_3$ and $v_6$ any further.
Now, if we find the rank of $\left[v_1|v_2|v_4|v_5\right]$, we get 3, so $\dim(U_1+U_2)=3$. Therefore, $\dim(U_1\cap U_2) = 2+2-3=1$. Hence, to find a basis of $U_1 \cap U_2$, we need only find any non-zero vector in the space.
Let $0\neq v \in U_1\cap U_2$. Then we can write $v=\alpha_1v_1 + \alpha_2v_2$, and $v=\alpha_4v_4 + \alpha_5v_5$. Subtracting these equations, we have $0=\alpha_1v_1 + \alpha_2v_2 -\alpha_4v_4 - \alpha_5v_5$. Remember we want a non-zero solution for $v$, and observe that the rank of $\left[v_1|v_2|v_4\right]$ is 3 (i.e. these three vectors are linearly independent). Hence we can take $\alpha_5=9$, say (this means we don't have fractions later, but there's no way to know this a priori!), and solve for the other $\alpha_i$'s. Using Gaussian elimination, we obtain $\alpha_1=4$, $\alpha_2=10$ and $\alpha_4 = -6$. Thus $\begin{equation*} v=4v_1+10v_2 = -6v_4+9v_5 = \begin{bmatrix} 24\\-6\\-12\\-6 \end{bmatrix}. \end{equation*}$
Finally, we can write our basis of $U_1\cap U_2$ as just $\{ v \}$.
Note that $rank([A_1|A_2])=3$, i.e. $\dim(U_1+U_2)=3$ (Note that $[A_1|A_2]$ is just the $4\times 6$ matrix formed by concatenating $A_1$ and $A_2$.) This means that $\dim (U_1\cap U_2) = \dim(U_1)+\dim(U_2)-\dim(U_1+U_2) = 2+2-3=1$, so again, to find a basis of $U_1\cap U_2$, we need only find a non-zero vector in the space. We proceed in a similar way to Question 2.12.
Firstly, we observe that we can write $v_3=v_1+v_2$ and $v_6=v_4+v_5$, so these two vectors can safely be ignored. Secondly, observe that $\left[v_1|v_2|v_5\right]$ has rank three, so (using the notation of Question 12) if we take $\alpha_4=1$, say, and solve then we have $\alpha_1=3$, $\alpha_2=1$, and $\alpha_5=0$. In other words, our non-zero vector is $3v_1+v_2 = v_4 = (3,1,7,3)^{\mathsf{T}}$, and our basis of $U_1\cap U_2$ is $\{ (3,1,7,3)^{\mathsf{T}} \}$.
Doing the same dimensional analysis as the previous three questions, we find that $F\cap G$ has dimension 1. We have $F=span[(1,0,1), (0,1,1)]$, and $G=span[ (1,1,1),(-1,1,-3)]$.
Proceeding in the same way as before, we find that $-4v_1-2v_2 = -3v_3+v_4$, and hence $\{(-4,-2,-6)\}$ is a basis of $F \cap G$, which agrees with Part b.
From the coefficients on the right, we have $A_\Phi = \begin{bmatrix} 3&2&1\\ 1&1&1\\ 1&-3&0\\ 2&3&1 \end{bmatrix}$.
Using Gaussian elimination, we can compute that $rank(A_\Phi) = 3$. From this we deduce that the kernel is trivial (i.e. only $(0,0,0)$), and clearly $Im(\Phi)= \{ (3x_1+2x_2+x_1,x_1+x_2+x_3,x_1-3x_2,2x_1+3x_2+x_3)^{\mathsf{T}} : x_1, x_2, x_3 \in \mathbb{R} \}$. We have $\dim(\ker(\Phi))=0$, and $\dim(Im(\Phi))=rank(A_\Phi)=3$.
Let $P$ be the change of basis matrix from the standard basis of $B$ to $\mathbb{R}^3$. Then $P=\begin{bmatrix} 1&1&1\\ 1&2&0\\ 1&1&0 \end{bmatrix}$.
The matrix $\overline{A_\Phi}$ is given by $P^{-1}A_\Phi P = \begin{bmatrix} 6&9&1\\-3&-5&0\\-1&-1&0 \end{bmatrix}$.
Let $M=[c_1|c_2|c_3]$, and observe that $\det M = 4 \neq 0$, so the vectors are linearly independent. Since $\mathbb{R}^3$ had dimension 3, and we have three linearly independent vectors, $C$ must indeed be a basis.
Indeed, such an $M$ is the change of basis matrix from $C$ to $C'$ (write the old vectors in terms of the new!) and this is thus the $P_2$ we require. Thus $P_2 = \begin{bmatrix} 1&0&1\\ 2&-1&0\\-1&2&-1 \end{bmatrix}$.
We first need to apply $P_1$ to change from basis $B'$ to $B$. Then $A_\Phi$ will map us to $(\mathbb{R}^3, C)$, before $P_2$ will take us to $C'$. Remember that matrices are acting like functions here, so they are applied to (column) vectors from right to left. Therefore the multiplication we require is $A'=P_2 A_\Phi P_1$. (This is what part f is asking us to recognise.)
We have $A' = \begin{bmatrix} 0&2\\-10&3\\ 12&-4 \end{bmatrix}$.
$P_1 \begin{bmatrix}2\\3\end{bmatrix}=\begin{bmatrix}8\\9\end{bmatrix}$.
$A_\Phi \begin{bmatrix}8\\9\end{bmatrix}=\begin{bmatrix}-1\\9\\7\end{bmatrix}$.
$P_2 \begin{bmatrix}-1\\9\\7\end{bmatrix}=\begin{bmatrix}6\\-11\\12\end{bmatrix}$.
And observe that $A' \begin{bmatrix}2\\3\end{bmatrix}=\begin{bmatrix}6\\-11\\12\end{bmatrix}$, indeed!
Analytic Geometry
We must show that $\langle\cdot,\cdot\rangle$ is a bilinear, symmetric, positive definite map.
It is a calculation to show that $\langle \alpha x + \beta x', y \rangle = \alpha \langle x,y\rangle + \beta \langle x',y\rangle$, and similarly for the second argument, and so $\langle\cdot,\cdot\rangle$ is bilinear.
We have $\langle y,x\rangle = y_1x_1 - (y_1x_2+y_2x_1)+2y_2x_2 = \langle x,y\rangle$ indeed, so we have symmetry.
Finally, $\langle x,x \rangle = x_1^2 - 2x_1x_2+2x_2^2 = (x_1-x_2)^2 + x_2^2$. Clearly, if $x=0$, then $\langle x,x \rangle = 0$. If $x\neq 0$, then either $x_2 \neq 0$, in which case $\langle x,x \rangle > 0$; or $x_2=0$ with $x_1 \neq 0$, so $\langle x,x \rangle > 0$ again (since it's the sum of two squares of real numbers). Therefore $\langle \cdot , \cdot \rangle$ is an inner product indeed!
Let $v_1, \dots, v_4$ be the four vectors defined in the question. Observe that $rank[v_1|v_2|v_3|v_4]=3$. Moreover, $rank[v_1|v_2|v_3]=3$, so these three vectors form a basis of $U$. Let $B$ be this matrix of basis vectors, i.e. $B=[v_1|v_2|v_3]$.
Now we compute $B^{\mathsf{T}}B = \begin{bmatrix} 9&9&0\\ 9&16&-14\\ 0&-14&31 \end{bmatrix}$ and $B^{\mathsf{T}}x = \begin{bmatrix} 9\\23\\-25 \end{bmatrix}$.
Next, we solve $B^{\mathsf{T}}B\lambda = B^{\mathsf{T}}x$ for $\lambda$. Using Gaussian elimination, we obtain $\lambda = \begin{bmatrix}-3\\ 4\\ 1 \end{bmatrix}$.
Finally, $\pi_U(x)$ is given by $B\lambda = \begin{bmatrix} 1\\-5\\-1\\-2\\3 \end{bmatrix}$.
By construction, $d(x,U) = d(x,\pi_U(x))$. This is given by $\lVert x-\pi_U(x) \rVert = \begin{Vmatrix}-2\\-4\\0\\6\\-2 \end{Vmatrix}$.
This norm is given by the square root of the dot product of the vector with itself, so $d(x,U) = \sqrt{60}=2\sqrt{15}$.
We require $\langle e_1, e_2-\pi_U(e_2)\rangle = 0$ and $\langle e_3, e_2-\pi_U(e_2)\rangle = 0$. That is to say, $\left\langle \begin{bmatrix} 1\\0\\0 \end{bmatrix}, \begin{bmatrix}-\lambda_1\\1\\ -\lambda_2 \end{bmatrix} \right\rangle = 0$, and $\left\langle \begin{bmatrix} 0\\0\\1 \end{bmatrix}, \begin{bmatrix}-\lambda_1\\1\\ -\lambda_2 \end{bmatrix} \right\rangle = 0$.
Computing the first gives us $\lambda_1=\frac12$, while the second gives $\lambda_2 = -\frac12$.
Therefore, $\pi_U(e_2) = \begin{bmatrix} \frac12 \\ 0\\ -\frac12 \end{bmatrix}$.
We have $Im(id-\pi) = \{(id-\pi)(x):x\in V\} = \{x-\pi(x):x\in V\}$. Observe that $\pi(x-\pi(x)) = \pi(x)-\pi^2(x) = 0$, since $\pi$ is a projection. Thus, $Im(id-\pi)\subseteq\ker\pi$.
Now, suppose $k\in \ker \pi$. Then $k-\pi(k)=k$, so $k\in Im(id-\pi)$. Thus $\ker\pi\subseteq Im(id-\pi)$. Therefore, we have that $Im(id-\pi) = \ker\pi$. $\lozenge$
We have that $\ker(id-\pi) = \{x\in V : (id-\pi)(x) = 0\} = \{x\in V : x=\pi(x)\}$. Clearly, $\ker(id-\pi)\subseteq Im\pi$.
Take $x\in Im\pi$. Then there exists some $y\in V$ such that $\pi(y)=x$. Observe that $(id-\pi)(x) = x-\pi(x) = \pi(y)-\pi^2(y) = 0$, since $\pi$ is a projection. Hence $Im\pi\subseteq\ker(id-\pi)$. Therefore we have that $\ker(id-\pi) = Im\pi$.
I will assume we use the standard dot product as our inner product. Firstly, let's get an orthogonal basis, then we can simply divide by the magnitude of each vector to get our orthonormal basis.
Since $b_1$ is our first vector, there is nothing to do for now. To get the second vector in our basis, we need a vector, $b_2'$, perpendicular to $b_1$, such that $span(b_1,b_2) = span (b_1, b_2')$. This is given by $b_2' = b_2 - \pi_{span(b_1)}(b_2)$.
Now, $\pi_{span(b_1)}(b_2) = \frac{b_1b_1^\mathsf{T}}{\lVert b_1 \rVert^2} b_2 = \frac{1}{3} \begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}\begin{bmatrix}-1\\2\\0\end{bmatrix} = \frac{1}{3}\begin{bmatrix}1\\1\\1\end{bmatrix}$.
Therefore, $b_2' = \begin{bmatrix}-\frac43 \\ \frac53 \\ -\frac13 \end{bmatrix}$.
Hence, our orthonormal basis, $C$, is given by $C=\left\{\frac{1}{\sqrt3}\begin{bmatrix} 1\\1\\1 \end{bmatrix} , \frac{1}{\sqrt42}\begin{bmatrix}-4\\5\\-1 \end{bmatrix}\right\}$.
I will assume we need to rotate these vectors anticlockwise. A rotation about an angle $\theta$ is given by the matrix $\begin{bmatrix} \cos\theta&-\sin\theta\\ \sin\theta&\cos\theta \end{bmatrix}$. Thus for $\theta=30^\circ$, we have $R=\begin{bmatrix} \frac{\sqrt{3}}{2}&-\frac{1}{2}\\ \frac12 & \frac{\sqrt3}{2} \end{bmatrix} = \frac12 \begin{bmatrix} \sqrt3 & -1\\ 1&\sqrt3 \end{bmatrix}$.
Therefore, $Rx_1 = \frac12 \begin{bmatrix} 2\sqrt3-3\\2+3\sqrt3 \end{bmatrix}$ and $Rx_2 = \frac12 \begin{bmatrix} 1\\-\sqrt3 \end{bmatrix}$.
Matrix Decompositions
Perform Gaussian elimination to "fix" the first columns. I have used only the rule allowing me to add a multiple of a row to a different row, which doesn't change the determinant. We have $\begin{bmatrix} 2&0&1&2&0\\ 0&-1&-1&-1&1\\ 0&0&1&0&3\\ 0&0&3&1&2\\ 0&0&-1&-1&1 \end{bmatrix}$.
From here, we can either continue with Gaussian elimination to get our matrix into upper triangular form, then multiply the entries on the diagonal together (remembering to take into account any elementary operations which would change the determinant!), or we can simply compute the determinant of the lower-right $3\times 3$ matrix, since this is quick to do by hand (it is $-3$). Thus the determinant of the overall matrix is $2\cdot -1\cdot -3 = 6$.
If we solve the equation $\det(A-\lambda I) = 0$ for $\lambda$, we obtain $\lambda = 1$ only. (Or, indeed, we can observe that $A$ is in lower triangular form, so the eigenvalues are the entries on the main diagonal.)
The space $E_1 = \{x\in \mathbb{R}^2 : (A-I)(x)=0\} = span\{(0,1)\}$.
If we take $\det(A-\lambda I)$ and simplify, we have $(\lambda-2)(\lambda-1)(\lambda+1)^2$., so we have three eigenvalues. For $\lambda=2,1$, our eigenspace will certainly have dimension 1. For $\lambda = -1$, it could (at this stage!) have dimension 1 or 2.
Observe that $(1,0,1,1)$ is an eigenvector with eigenvalue 2, and $(1,1,1,1)$ is an eigenvector with eigenvalue 1. Thus $E_2=span\{(1,0,1,1)\}$ and $E_1 = span\{(1,1,1,1)\}$.
Now observe that $rank(A+I)=3$, so there can only be one linearly independent eigenvector with eigenvalue -1. Note that $(0,1,1,0)$ will do. Hence $E_{-1} = span\{(0,1,1,0)\}$.
Only the first two are diagonalisable – they are diagonal, so there is nothing to prove! The other two, however, are not diagonalisable – each only has one linearly independent eigenvector, whereas we need there to exist a basis of eigenvectors to yield diagonalisability.
Only the first and third matrices are invertible – the determinants are non-zero, while the other two matrices have determinant zero.
This tells us that diagonalisability is indeed unrelated to invertibility!
The characteristic polynomial is $(5-\lambda)(1-\lambda)^2$. Now, $rank(A-I) = 2$, so there is only one linearly independent eigenvector for $\lambda=1$. Hence $A$ is not diagonalisable.
We have $E_5 = span\{(1,1,0)\}$, and $E_1=span\{(-3,1,0)\}$.
The characteristic polynomial here is $-\lambda^3(1-\lambda)$, so the eigenvalues are 1, and 0 (with multiplicity 3). Observe that $rank(A-0I) = rank A = 1$, so there are three linearly independent eigenvectors for the eigenvalue 0. With the other eigenvector for $\lambda=1$, we will have a basis of eigenvectors, and hence $A$ will be diagonalisable.
We have $E_1 = span\{(1,0,0,0)\}$, and $E_0 = span\{(1,-1,0,0),(0,0,1,0),(0,0,0,1)\}$.
First, we compute $A^{\mathsf{T}}A = \begin{bmatrix} 13&12&2\\ 12&13&-2\\ 2&-2&8 \end{bmatrix}$. We can diagonalise this to find $D = \begin{bmatrix} 25&0&0\\ 0&9&0\\ 0&0&0 \end{bmatrix}$ and $P = \begin{bmatrix} \frac{1}{\sqrt2}&\frac{1}{\sqrt{18}}&-\frac23\\ \frac{1}{\sqrt2}&-\frac{1}{\sqrt{18}}&\frac23\\ 0&\frac{4}{\sqrt{18}}&\frac13 \end{bmatrix}$.
We take the square roots of the entries of $D$ to find $\Sigma = \begin{bmatrix} 5&0&0\\ 0&3&0 \end{bmatrix}$, with $V$ equalling our $P$ above.
From here, we compute $u_1 = \frac1{\sigma_1}Av_1 = \frac15\begin{bmatrix}3&2&2\\2&3&-2\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt2}\\\frac{1}{\sqrt2}\\0\end{bmatrix} =\begin{bmatrix}\frac{1}{\sqrt2}\\\frac{1}{\sqrt2}\end{bmatrix}$ , and $u_2 = \frac{1}{\sigma_2} A v_2 = \frac13 \begin{bmatrix}3&2&2\\2&3&-2\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt{18}}\\-\frac{1}{\sqrt{18}}\\\frac{4}{\sqrt{18}}\end{bmatrix} =\begin{bmatrix}\frac{1}{\sqrt2}\\-\frac{1}{\sqrt2}\end{bmatrix}$, giving $U=\frac{1}{\sqrt2}\begin{bmatrix}1&1\\1&-1\end{bmatrix}$.
It can be checked that $A = U\Sigma V^{\mathsf{T}}$, indeed!
Observe that the eigenvalues of $A$ are complex, so we cannot simply find the eigendecomposition. Proceeding as in the previous question, we have $A^{\mathsf{T}}A = \begin{bmatrix}5&3\\3&5\end{bmatrix}$, which when we perform the eigendecomposition on this new matrix, we obtain $D = \begin{bmatrix}8&0\\0&2\end{bmatrix}$ and $P=\frac{1}{\sqrt2}\begin{bmatrix}1&-1\\1&1\end{bmatrix}$. This $P$ is again our required $V$, and we have $\Sigma = \begin{bmatrix}2\sqrt2 & 0\\0&\sqrt2\end{bmatrix}$.
As before, we can now compute
$u_1 = \frac{1}{2\sqrt2}\begin{bmatrix}2&2\\-1&1\end{bmatrix}\begin{pmatrix}\frac{1}{\sqrt2}\\\frac{1}{\sqrt2}\end{pmatrix} = \begin{bmatrix}1\\0\end{bmatrix}$, and similarly $u_2 = \begin{bmatrix}0\\1\end{bmatrix}$. Hence, our $U$ turns out to be the identity matrix.
Using our Singular value decomposition from Question 8, we construct $A_1 = \sigma_1 u_1 v_1^{\mathsf{T}} = 5 \begin{bmatrix}\frac{1}{\sqrt2}\\\frac{1}{\sqrt2}\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt2}&\frac{1}{\sqrt2}&0\end{bmatrix} =\frac52 \begin{bmatrix}1&1&0\\1&1&0\end{bmatrix}$, and similarly $A_2 = \frac12 \begin{bmatrix}1&-1&4\\-1&1&-4\end{bmatrix}$.
Then $A_1$ and $A_2$ both have rank one, with $A=A_1+A_2$, as required.
We need only show this in one direction, since we can replace $A$ with $A^{\mathsf{T}}$ below, and the argument still hold true.
Suppose $\lambda\neq 0$ is an eigenvalue of $A^{\mathsf{T}}A$. Then there is some vector $x\neq 0$ such that $A^{\mathsf{T}}Ax = \lambda x$. Thus, $AA^{\mathsf{T}}Ax = \lambda Ax$. Therefore, $Ax$ is an eigenvector of $AA^{\mathsf{T}}$, with eigenvalue $\lambda$. Please not that $x$ is in general not the eigenvector of $AA^{\mathsf{T}}$, in fact it can't even be multiplied by it, because it is $n$-dimensional, and not $m$-dimensional.
Vector Calculus
Firstly, let's rewrite $f(x)$ as $f(x) = 4\log (x)\sin(x^3)$.
Then, $f'(x) = \frac4x\sin(x^3)+12x^2\log(x)\cos(x^3)$.
We compute the first five derivatives of our function at 0. We have $f(0)=f'(0)=1$, $f^{(2)}(0)=f^{(3)}(0)=-1$, and $f^{(4)}(0)=f^{(5)}(0)=1$.
The Taylor polynomial $T_5(x) = 1+x-\frac12 x^2 -\frac16 x^3 + \frac1{24} x^4 + \frac1{120}x^5$. The lower-order Taylor polynomials can be found by truncating this expression appropriately.
We have $\frac{df}{dt} = \cos(\log(t^{\mathsf{T}}t))\cdot\frac1{t^{\mathsf{T}}t}\cdot \begin{bmatrix} 2t_1&2t_2&\cdots&2t_D \end{bmatrix}= \cos(\log(t^{\mathsf{T}}t))\cdot\frac{2t^{\mathsf{T}}}{t^{\mathsf{T}}t}$.
For $g$, if we explicitly compute $AXB$ and find its trace, we have that $g(X) = \sum_{k=1}^D \sum_{j=1}^F \sum_{i=1}^E a_{ki}x_{ij}b_{jk}$. Thus we have, $\frac{\partial g}{\partial x_{ij}} = \sum_{k=1}^D b_{jk}a_{ki}$, and this is the $(i,j)$-th entry of the required derivative.
If we look at the last equation, we can see that the $(i, j)$-th partial derivative is the product of $j$-th row of $B$ and $i$-th column of $A$. Therefore, the right side of the equation is $\left(BA\right)_{ji}$. Since the order of the indices is reversed, i.e., the $(i, j)$-th partial derivative is the $(j, i)$-th element of $BA$, $\frac{dg}{dX} = (BA)^{\mathsf{T}} = A^{\mathsf{T}}B^{\mathsf{T}}$.
Checking the dimensions, we can see that $\left(BA\right)^{\mathsf{T}} \in R^{E \times F}$ which corresponds with the dimensions of $X$.
The chain rule tells us that $\frac{df}{dx} = \frac{df}{dz}\frac{dz}{dx}$, where $\frac{df}{dz}$ has dimension $1\times 1$, and $\frac{dz}{dx}$ has dimension $1\times D$. We know $\frac{dz}{dx}=2x^{\mathsf{T}}$ from $f$ in Question 6. Also, $\frac{df}{dz} = \frac{1}{1+z}$.
Therefore, $\frac{df}{dx} = \frac{2x^{\mathsf{T}}}{1+x^{\mathsf{T}}x}$.
Here we have $\frac{df}{dz}$ is an $E\times E$ matrix, namely $\begin{bmatrix} \cos z_1 & 0 & \cdots & 0\\ 0 & \cos z_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0&\cdots&\cos z_E \end{bmatrix}$.
Also, $\frac{dz}{dx}$ is an $E\times D$-dimensional matrix, namely $A$ itself.
The overall derivative is obtained by multiplying these two matrices together, which will again give us an $E\times D$-dimensional matrix.
We have $\frac{df}{dz}$ has dimension $1\times 1$, and is simply $-\frac12 \exp(-\frac12 z)$.
Now, $\frac{dz}{dy}$ has dimension $1\times D$, and is given by $y^{\mathsf{T}}(S^{-1}+(S^{-1})^{\mathsf{T}})$.
Finally, $\frac{dy}{dx}$ has dimension $D\times D$, and is just the identity matrix.
Again, we multiply these all together to get our final derivative.
If we explicitly write out $xx^{\mathsf{T}}+\sigma^2I$, and compute its trace, we find that $f(x) = x_1^2 + \dots + x_n^2 + n\sigma^2$.
Hence, $\frac{df}{dx} = 2x^{\mathsf{T}}$.
Here, $\frac{df}{dz} = \begin{bmatrix} \frac{1}{\cosh^2z_1}&0&\cdots&0\\ 0&\frac{1}{\cosh^2z_2}&\cdots&0\\ \vdots & \vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{1}{\cosh^2z_M} \end{bmatrix}$, while $\frac{dz}{dx} = A$, as in Question 7b.
Finally, $\frac{df}{dx}$ is given by the product of these two matrices.
Piecing this together, replacing $z$ with $t(\epsilon,\nu)$ throughout, we have that $g(\nu) = \log(p(x,t(\epsilon,\nu))) - \log(q(t(\epsilon,\nu),\nu))$.
Therefore, $\frac{dg}{d\nu} = \frac{p'(x,t(\epsilon,\nu))\cdot t'(\epsilon,\nu) }{p(x,t(\epsilon,\nu))} - \frac{q'(t(\epsilon,\nu),\nu)\cdot t'(\epsilon,\nu)}{q(t(\epsilon,\nu),\nu)} $.
Probability and Distributions
The marginal distributions are obtained by summing the probabilies over all the values of the variable being marginalized. Thus, to obtain $p(x)$ we sum over columns (i.e., over the values corresponding to different $y$):
\begin{aligned} p(x_1) &= P(X = x_1) = P(X = x_1, Y = y_1) + P(X = x_1, Y = y_2) + P(X = x_1, Y = y_3) = 0.01 + 0.05 + 0.1 = 0.16 \\ p(x_2) &= P(X = x_2) = P(X = x_2, Y = y_1) + P(X = x_2, Y = y_2) + P(X = x_2, Y = y_3) = 0.02 + 0.1 + 0.05 = 0.17\\ p(x_3) &= P(X = x_3) = P(X = x_3, Y = y_1) + P(X = x_3, Y = y_2) + P(X = x_3, Y = y_3) = 0.03 + 0.05 + 0.03 = 0.11\\ p(x_4) &= P(X = x_4) = P(X = x_4, Y = y_1) + P(X = x_4, Y = y_2) + P(X = x_4, Y = y_3) = 0.1 + 0.07 + 0.05 = 0.22\\ p(x_5) &= P(X = x_5) = P(X = x_5, Y = y_1) + P(X = x_5, Y = y_2) + P(X = x_5, Y = y_3) = 0.1 + 0.2 + 0.04 = 0.34 \end{aligned}As a correctness check, note that this distribution satisfies the normalization condition, i.e. that sum of the probabilities is $1$:
\begin{equation} \sum_{i=1}^5 p(x_i) = 1 \end{equation}The marginal distribution $p(y)$ can be obtained in a similar way, by summing the matrix rows:
\begin{aligned} p(y_1) &= P(Y = y_1) = \sum_{i=1}^5 P(X = x_i, Y = y_1) = 0.01 + 0.02 + 0.03 + 0.1 + 0.1 = 0.26 \\ p(y_2) &= P(Y = y_2) = \sum_{i=1}^5 P(X = x_i, Y = y_2) = 0.05 + 0.1 + 0.05 + 0.07 + 0.2 = 0.47 \\ p(y_3) &= P(Y = y_3) = \sum_{i=1}^5 P(X = x_i, Y = y_3) = 0.1 + 0.05 + 0.03 + 0.05 + 0.04 = 0.27 \end{aligned}We can again check that the normalization condition is satisfied:
\begin{equation} \sum_{i=1}^3p(y_i) = 1 \end{equation}To determine conditional distributions we use the definition of the conditional probability:
\[P(X = x , Y = y_1) = P(X = x | Y = y_1)P(Y = y_1) = p(x | Y = y_1) p(y_1).\]
Thus,
\begin{align*} p(x_1 | Y = y_1) &= \frac{P(X = x_1, Y = y_1)}{p(y_1)} = \frac{0.01}{0.26} \approx 0.038\\ p(x_2 | Y = y_1) &= \frac{P(X = x_2, Y = y_1)}{p(y_1)} = \frac{0.02}{0.26} \approx 0.077\\ p(x_3 | Y = y_1) &= \frac{P(X = x_3, Y = y_1)}{p(y_1)} = \frac{0.03}{0.26} \approx 0.115\\ p(x_4 | Y = y_1) &= \frac{P(X = x_4, Y = y_1)}{p(y_1)} = \frac{0.1}{0.26} \approx 0.385\\ p(x_5 | Y = y_1) &= \frac{P(X = x_5, Y = y_1)}{p(y_1)} = \frac{0.1}{0.26} \approx 0.385 \end{align*}Likewise the conditional distribution $p(y | X = x_3)$ is given by
\begin{align*} p(y_1 | X = y_3) &= \frac{P(X = x_3, Y = y_1)}{p(x_3)} = \frac{0.03}{0.11} \approx 0.273\\ p(y_2 | X = y_3) &= \frac{P(X = x_3, Y = y_2)}{p(x_3)} = \frac{0.05}{0.11} \approx 0.454\\ p(y_3 | X = y_3) &= \frac{P(X = x_3, Y = y_3)}{p(x_3)} = \frac{0.03}{0.11} \approx 0.273 \end{align*}
We can write the probability density of the two-dimensional distribution as
\[p(x,y)= 0.4\mathcal{N}\left(x, y|\begin{bmatrix} 10\\ 2\end{bmatrix}, \begin{bmatrix} 1&0\\0&1\end{bmatrix}\right)+ 0.6\mathcal{N}\left(x, y|\begin{bmatrix} 0\\ 0\end{bmatrix}, \begin{bmatrix} 8.4&2.0\\2.0&1.7\end{bmatrix}\right)\]
The marginal distribution of a weighted sum of distributions is given by the weighted sum of marginals, whereas the marginals of a bivariate normal distribution $\mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})$ are obtained according to the rule
\[\int \mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})dy= \mathcal{N}(x|\mu_x, \Sigma_{xx}), \\ \int \mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})dx = \mathcal{N}(y|\mu_y, \Sigma_{yy}) \]
Thus, the marginals of the distribution of interest are
\begin{gather*} p(x) = 0.4\mathcal{N}(x| 10, 1) + 0.6\mathcal{N}(x| 0, 8.4),\\ p(y) = 0.4\mathcal{N}(x| 2, 1) + 0.6\mathcal{N}(x| 0, 1.7) \end{gather*}The mean of a weighted sum of two distributions is the weighted sum of their averages
\begin{gather*} \mathbb{E}_X[x] = 0.4*10 + 0.6*0 = 4,\\ \mathbb{E}_Y[y] = 0.4*2 + 0.6*0 = 0.8 \end{gather*}The mode of a continuous distribution is a point where this distribution has a peak. It can be determined by solving the extremum condition for each of the marginal distributions:
\begin{gather*} \frac{dp(x)}{dx} = 0,\\ \frac{dp(y)}{dy} = 0 \end{gather*}In the case of a mixture of normal distributions these equations are non-linear and can be solved only numerically. After finding all the solutions of these equations one has to verify for every solution that it is a peak rather than an inflection point, i.e. that at this point
\[\frac{d^2p(x)}{dx^2} < 0 \text{ or } \frac{d^2p(y)}{dy^2} < 0\]
The medians $m_x, m_y$ can be determined from the conditions
\begin{gather*} \int_{-\infty}^{m}p(x)dx = \int^{+\infty}_{m}p(x)dx,\\ \int_{-\infty}^{m}p(y)dy = \int^{+\infty}_{m}p(y)dy \end{gather*}Again, these equations can be solved here only numerically.
The mean of a two-dimensional distribution is a vector of means of the marginal distributions
\[\mathbf{\mu} = \begin{bmatrix}4\\0.8\end{bmatrix}\]
The mode of two dimensional distribution is obtained first by solving the extremum conditions
\[\frac{\partial p(x,y)}{\partial x} = 0, \frac{\partial p(x,y)}{\partial y} = 0\]
and then verifying for every solution that it is indeed a peak, i.e.
\begin{gather*} \frac{\partial^2 p(x,y)}{\partial x^2} < 0, \frac{\partial^2 p(x,y)}{\partial y^2} < 0,\\ \det\left( \begin{bmatrix} \frac{\partial^2 p(x,y)}{\partial x^2} & \frac{\partial^2 p(x,y)}{\partial x\partial y}\\ \frac{\partial^2 p(x,y)}{\partial x\partial y} & \frac{\partial^2 p(x,y)}{\partial y^2} \end{bmatrix} \right) > 0 \end{gather*}Again, these equations can be solved only numerically.
The conjugate prior to the Bernoulli distribution is the Beta distribution
\[p(\mu | \alpha, \beta) =\frac{1}{\mathcal{B}(\alpha, \beta)} \mu^{\alpha -1}(1-\mu)^{\beta-1} \propto \mu^{\alpha -1}(1-\mu)^{\beta-1},\]
where $\alpha,\beta$ are not necessarily integers and the normalisation coefficient si the Beta function defined as
\[\mathcal{B}(\alpha, \beta) = \int_0^1 t^{\alpha -1}(1-t)^{\beta-1}dt\]
The likelihood of observing data $\{x_1, x_2, ..., x_N\}$ is
\[p(x_1, ..., x_N|\mu) = \prod_{i=1}^Np(x_i|\mu) = \prod_{i=1}^N \mu^{x_i}(1-\mu)^{1-x_i} = \mu^{\sum_{i=1}^N x_i}(1-\mu)^{N-\sum_{i=1}^N x_i}\]
The posterior distribution is proportional to the product of this likelihood with the prior distribution (Bayes theorem):
\[p(\mu |x_1, ..., x_N) \propto p(x_1, ..., x_N|\mu)p(\mu | \alpha, \beta) \propto \mu^{\sum_{i=1}^N x_i + \alpha -1}(1-\mu)^{N-\sum_{i=1}^N x_i +\beta -1}\]
This is also a Beta distribution, i.e. our choice of the gonjugate prior was correct. The normalization constant is readily determined:
\[p(\mu |x_1, ..., x_N) = \frac{1}{\mathcal{B}(\sum_{i=1}^N x_i + \alpha -1, N-\sum_{i=1}^N x_i +\beta -1)} \mu^{\sum_{i=1}^N x_i + \alpha -1}(1-\mu)^{N-\sum_{i=1}^N x_i +\beta -1}\]
The probabilities of picking a mango or an apple from the first bag are given by
\begin{gather*} p(\text{mango} |1) = \frac{4}{6} = \frac{2}{3}\\ p(\text{apple} |1) = \frac{2}{6} = \frac{1}{3} \end{gather*}The probabilities of picking a mango or an apple from the second bag are
\begin{gather*} p(\text{mango} |2) = \frac{4}{8} = \frac{1}{2}\\ p(\text{apple} |2) = \frac{4}{8} = \frac{1}{2} \end{gather*}The probability of picking the first or the second bag are equal to the probabilities of head and tail respectively:
\begin{gather*} p(1) = 0.6,\\ p(2) = 0.4 \end{gather*}We now can obtain the probability that the mango was picked from the second bag using Bayes' theorem:
\[ p(2 | \text{mango}) = \frac{p(\text{mango} | 2)p(2)}{p(\text{mango})} = \frac{p(\text{mango} | 2)p(2)}{p(\text{mango} | 1)p(1) + p(\text{mango} | 2)p(2)} = \frac{\frac{1}{2}0.4}{\frac{2}{3}0.6 + \frac{1}{2}0.4} = \frac{1}{3} \]
Let $\mathbf{z} = \mathbf{A}\mathbf{x}_{t+1}$. Since this is a linear transformation of a Gaussian random variable, $\mathbf{x}_t \sim \mathcal{N}(\mathbf{\mu}_t,\mathbf{\Sigma})$, then $\mathbf{z}$ is distributed as (see Eq. (6.88))
\[\mathbf{z} \sim \mathcal{N}(\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T),\]
whereas the mean and the covariance of a sum of two Gaussian random variables are given by the sum of the means and the covariances of these variables, i.e.,
\[\mathbf{x}_{t+1} = \mathbf{z} + \mathbf{w} \sim \mathcal{N}(\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}),\]
That is
\[p(\mathbf{x}_{t+1}|\mathbf{y}_1,...,\mathbf{y}_t)= \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}). \]
If we assume that $\mathbf{x}_{t+1}$ is fixed, then $\mathbf{y}_{t+1} = \mathbf{C}\mathbf{x}_{t+1} + \mathbf{v}$ follows the same distribution as $\mathbf{v}$, but with the mean shifted by $\mathbf{C}\mathbf{x}_{t+1}$, i.e.
\[ p(\mathbf{y}_{t+1}|\mathbf{x}_{t+1}, \mathbf{y}_1,...,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}). \]
The the joint probability is obtained as
\[ p(\mathbf{y}_{t+1}, \mathbf{x}_{t+1}| \mathbf{y}_1,...,\mathbf{y}_t)= p(\mathbf{y}_{t+1}|\mathbf{x}_{t+1}, \mathbf{y}_1,...,\mathbf{y}_t) p(\mathbf{x}_{t+1}| \mathbf{y}_1,...,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}) \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}). \]
Let us introduce temporary notation
\[ \mathbf{\mu}_{t+1} = \mathbf{A}\mathbf{\mu}_t,\\ \mathbf{\Sigma}_{t+1} = \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q},\\ p(\mathbf{x}_{t+1}|\mathbf{y}_1,...,\mathbf{y}_t) = \mathcal{N}(\mathbf{\mu}_{t+1}, \mathbf{\Sigma}_{t+1}) \]
Then $\mathbf{y}_{t+1}$ is obtained in terms of the parameters of distribution $p(\mathbf{x}_{t+1}|\mathbf{y}_1,...,\mathbf{y}_t)$ following the same steps as question 1), with the result
\[ p(\mathbf{y}_{t+1}|\mathbf{y}_1,...,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{\mu}_{t+1}, \mathbf{C}\mathbf{\Sigma}_{t+1}\mathbf{C}^T + \mathbf{R})= \mathcal{N}\left(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{A}\mathbf{\mu}_t, \mathbf{C}(\mathbf{A}\mathbf{\Sigma}\mathbf{A}^T+ \mathbf{Q})\mathbf{C}^T + \mathbf{R}\right). \]
The required conditional distribution is then obtained as
\[ p(\mathbf{x}_{t+1}|\mathbf{y}_1,...,\mathbf{y}_t, \mathbf{y}_{t+1})= \frac{p(\mathbf{y}_{t+1}, \mathbf{x}_{t+1}| \mathbf{y}_1,...,\mathbf{y}_t)} {p(\mathbf{y}_{t+1}| \mathbf{y}_1,...,\mathbf{y}_t)}= \frac{\mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}) \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q})} {\mathcal{N}\left(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{A}\mathbf{\mu}_t, \mathbf{C}(\mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q})\mathbf{C}^T + \mathbf{R}\right)} \]
The standard definition of variance is
\[ \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2], \]
where
\[ \mu = \mathbb{E}_X[x]. \]
Using the properties of average we can write:
\begin{gather*} \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2] = \mathbb{E}_X[x^2 - 2x\mu +\mu^2] = \mathbb{E}_X[x^2] - \mathbb{E}_X[2x\mu] + \mathbb{E}_X[\mu^2]=\\ \mathbb{E}_X[x^2] - 2\mu\mathbb{E}_X[x] + \mu^2 = \mathbb{E}_X[x^2] - 2\mu^2 + \mu^2 = \mathbb{E}_X[x^2] - \mu^2 \end{gather*}By substituting to this equation the definition of $\mu$, we obtain the desired equation
\[ \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2] = \mathbb{E}_X[x^2] - (\mathbb{E}_X[x])^2 \]
Let us expand the square in the left-hand side of (6.45)
\[ \frac{1}{N^2}\sum_{i,j=1}^N(x_i - x_j)^2 = \frac{1}{N^2}\sum_{i,j=1}^N(x_i^2 - 2x_i x_j + x_j^2) = \frac{1}{N^2}\sum_{i,j=1}^N x_i^2 - 2\frac{1}{N^2}\sum_{i,j=1}^N x_i x_j + \frac{1}{N^2}\sum_{i,j=1}^Nx_j^2 \]
We see that the first and the last term differ only by the summation index, i.e. they are identical: \[ \frac{1}{N^2}\sum_{i,j=1}^N x_i^2 + \frac{1}{N^2}\sum_{i,j=1}^Nx_j^2= 2\frac{1}{N^2}\sum_{i,j=1}^N x_i^2 = 2\frac{1}{N}\sum_{i=1}^N x_i^2, \]
since summation over $j$ gives factor $N$.
The remaining term can be written as
\[ 2\frac{1}{N^2}\sum_{i,j=1}^N x_i x_j = 2\frac{1}{N^2}\sum_{i=1}^N x_i \sum_{i=1}^N x_j = 2\left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2, \]
where we again used the fact that the sum is invariant to the index of summation.
We thus have proved the required relation that
\[ \frac{1}{N^2}\sum_{i,j=1}^N(x_i - x_j)^2 = 2\frac{1}{N}\sum_{i=1}^N x_i^2 - 2\left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2 \]
Bernoulli distribution is given by
\[ p(x|\mu) = \mu^x (1-\mu)^{1-x} \]
We can use relation
\[ a^x = e^{x\log a} \]
to write the Bernoulli distribution as
\[ p(x|\mu) = e^{x\log\mu + (1-x)\log(1-\mu)}= e^{x\log\left(\frac{\mu}{1-\mu}\right) + \log(1-\mu)} = h(x)e^{\theta x - A(\theta)}, \]
where the last equation is the definition of a single-parameter distribution from the exponential family, in which
\[ h(x) = 1,\\ \theta = \log\left(\frac{\mu}{1-\mu}\right) \leftrightarrow \mu = \frac{e^\theta}{1+e^\theta},\\ A(\theta) = -\log(1-\mu) = \log(1+e^\theta) \]
The binomial distribution can be transformed as
\[ p(x|N,\mu) = {N\choose x} \mu^x (1-\mu)^{N-x} = {N \choose x} e^{x\log\mu + (N-x)\log(1-\mu)}= {N \choose x}e^{x\log\left(\frac{\mu}{1-\mu}\right) +N\log(1-\mu)} = h(x)e^{x\theta - A(\theta)} \]
where
\[ h(x) = {N \choose x},\\ \theta = \log\left(\frac{\mu}{1-\mu}\right),\\ A(\theta) = -N\log(1-\mu) = N\log(1+e^\theta) \]
i.e., the binomial distribution can be represented as an exponential family distribution(only $\mu$ is treated here as a parameter, since the number of trials $N$ is fixed.)
Similarly, the beta distribution can be transformed as
\[ p(x |\alpha, \beta) = \frac{1}{\mathcal{B}(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} = e^{(\alpha -1)\log x + (\beta -1)\log(1-x) - \log(\mathcal{B}(\alpha,\beta))}= h(x)e^{\theta_1\phi_1(x) + \theta_2\phi_2(x) -A(\theta_1, \theta_2)} \]
where
\begin{gather*} h(x) = 1,\\ \theta_1 = \alpha-1, \theta_2 = \beta-1,\\ \phi_1(x) = \log x, \phi_2(x) = \log(1-x),\\ A(\theta_1, \theta_2) = \log(\mathcal{B}(\alpha,\beta)) = \log(\mathcal{B}(1+\theta_1,1 + \theta_2)) \end{gather*}i.e. this is a distribution form the exponential family.
The product of the two distributions is then given by
\begin{gather*} p(x|N,\mu) p(x|\alpha, \beta)={N \choose x}e^{x\log\left(\frac{\mu}{1-\mu}\right) + (\alpha-1)\log x + (\beta -1)\log(1-x) + N\log(1-\mu) - \log(\mathcal{B}(\alpha,\beta))}\\ = h(x) e^{\theta_1 \phi_1(x) + \theta_2 \phi_2(x) + \theta_3\phi_3(x) - A(\theta_1, \theta_2, \theta_3)} \end{gather*}where
\begin{gather*} h(x) = {N \choose x},\\ \theta_1 = \alpha-1, \theta_2 = \beta-1,\theta_3 = \log\left(\frac{\mu}{1-\mu}\right)\\ \phi_1(x) = \log x, \phi_2(x) = \log(1-x), \phi_3(x) = x\\ A(\theta_1, \theta_2, \theta_3) = \log(\mathcal{B}(\alpha,\beta)) -N\log(1-\mu) = \log(\mathcal{B}(1+\theta_1,1 + \theta_2)) + N\log(1+e^\theta_3) \end{gather*}
The two normal distributions are given by
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) = (2\pi)^{-\frac{D}{2}}|\mathbf{A}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})\right],\\ \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-\frac{D}{2}}|\mathbf{B}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})\right] \end{gather*}their product is
\[ \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-D}|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left[(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})+(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})\right]\right\} \]
The expression in the exponent can be written as
\begin{gather*} \Phi = (\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})+(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})=\\ \mathbf{x}^T\mathbf{A}^{-1}\mathbf{x} - \mathbf{a}^T\mathbf{A}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{A}^{-1}\mathbf{a}+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a}+ \mathbf{x}^T\mathbf{B}^{-1}\mathbf{x} - \mathbf{b}^T\mathbf{B}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{B}^{-1}\mathbf{b}+ \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}=\\ \mathbf{x}^T(\mathbf{A}^{-1}+\mathbf{B}^{-1})\mathbf{x}- (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})\mathbf{x}- \mathbf{x}^T(\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} \end{gather*}we now introduce notation
\begin{gather*} \mathbf{C}^{-1} = (\mathbf{A}^{-1}+\mathbf{B}^{-1}),\\ \mathbf{c} = \mathbf{C}(\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b}),\\ \mathbf{c}^T = (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})C\text{ (This can be checked by transposing the previous equation)} \end{gather*}The expression in the exponent now takes form
\begin{gather*} \Phi= \mathbf{x}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{C}^{-1}\mathbf{c}+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}=\\ \mathbf{x}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{C}^{-1}\mathbf{c}+ \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} + \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}- \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c}=\\ (\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \end{gather*}where we have completed the square.
The product of the two probability distributions can be now written as
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-D}|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left[(\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\}=\\ (2\pi)^{-\frac{D}{2}}|\mathbf{C}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})\right] \times (2\pi)^{-\frac{D}{2}}\frac{|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}}}{|\mathbf{C}|^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left[ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\}=\\ c\mathcal{N}(\mathbf{c}|\mathbf{c}, \mathbf{C}), \end{gather*}where we defined
\[ c = (2\pi)^{-\frac{D}{2}}\frac{|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}}}{|\mathbf{C}|^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left[ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\} \]
We now can used the properties that a) the determinant of a matrix product is product of the determinants, and b) determinant of a matrix inverse is the inverse of the determinant of this matrix, and write
\[ \frac{|\mathbf{A}||\mathbf{B}|}{|\mathbf{C}|}= \|\mathbf{A}||\mathbf{C}^{-1}||\mathbf{B}|= \|\mathbf{A}\mathbf{C}^{-1}\mathbf{B}|= \|\mathbf{A}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{B}|= \|\mathbf{A} + \mathbf{B}| \]
For the expression in the exponent we can write
\begin{gather*} \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c}= \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}- (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} (\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b})\\ =\mathbf{a}^T\left[\mathbf{A}^{-1} - \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}\right]\mathbf{a}+ \mathbf{b}^T\left[\mathbf{B}^{-1} - \mathbf{B}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{B}^{-1}\right]\mathbf{b}- \mathbf{a}^T\mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{B}^{-1}\mathbf{b}- \mathbf{b}^T\mathbf{B}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{A}^{-1}\mathbf{a} \end{gather*}Using the property $(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}$ we obtain
\[ \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{B}^{-1}= \left[\mathbf{B}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}\right]^{-1}= (\mathbf{A} + \mathbf{B})^{-1} \]
and
\begin{gather*} \mathbf{A}^{-1} - \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}=\\ \mathbf{A}^{-1}\left[1 - (\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}\right]=\\ \mathbf{A}^{-1}\left[1 - \mathbf{B}(\mathbf{A} + \mathbf{B})^{-1}\mathbf{A}\mathbf{A}^{-1}\right]=\\ \mathbf{A}^{-1}\left[1 - \mathbf{B}(\mathbf{A} + \mathbf{B})^{-1}\right]=\\ \mathbf{A}^{-1}\left[(\mathbf{A} + \mathbf{B}) - \mathbf{B}\right](\mathbf{A} + \mathbf{B})^{-1}=\\ (\mathbf{A} + \mathbf{B})^{-1} \end{gather*}we thus conclude that
\begin{gather*} c = (2\pi)^{-\frac{D}{2}}|\mathbf{A}+\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}( \mathbf{a} - \mathbf{b})^T(\mathbf{A} + \mathbf{B})^{-1}(\mathbf{a} -\mathbf{b})\right\}=\\ \mathcal{N}(\mathbf{b}|\mathbf{a}, \mathbf{A}+ \mathbf{B})=\\ \mathcal{N}(\mathbf{a}|\mathbf{b}, \mathbf{A}+ \mathbf{B}) \end{gather*}Multivariate normal distribution, $\mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})$ can be represented as a distribution from an exponential family:
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})= (2\pi)^{-\frac{D}{2}}|\mathbf{A}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})\right]=\\ (2\pi)^{-\frac{D}{2}} \exp\left[-\frac{1}{2}\text{tr}(\mathbf{A}^{-1}\mathbf{x}\mathbf{x}^T) + \mathbf{a}^T\mathbf{A}^{-1}\mathbf{x}- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}| \right], \end{gather*}where we used that $\mathbf{a}^T\mathbf{A}^{-1}\mathbf{x} = \mathbf{x}^T\mathbf{A}^{-1}\mathbf{a}$, and also write the first term as
\[ \mathbf{x}^T\mathbf{A}^{-1}\mathbf{x}= \sum_{i,j}x_i (\mathbf{A}^{-1})_{ij} x_j= \sum_{i,j}(\mathbf{A}^{-1})_{ij} x_j x_i= \sum_{i,j}(\mathbf{A}^{-1})_{ij} (\mathbf{x}\mathbf{x}^T)_{ji}= \text{tr}(\mathbf{A}^{-1}\mathbf{x}\mathbf{x}^T) \]
Representing $\mathcal{N}(\mathbf{x}|\mathbf{b},\mathbf{B})$ in a similar way and multiplying the two distributions we readily obtain
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})\mathcal{N}(\mathbf{x}|\mathbf{b},\mathbf{B})=\\ (2\pi)^{-D} \exp\left\{-\frac{1}{2}\text{tr}\left[(\mathbf{A}^{-1}+ \mathbf{B}^{-1})\mathbf{x}\mathbf{x}^T\right]+ (\mathbf{a}^T\mathbf{A}^{-1}+\mathbf{b}^T\mathbf{B}^{-1})\mathbf{x}- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}|- \frac{1}{2}\mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \frac{1}{2}\log|\mathbf{B}| \right\}=\\ c\mathcal{N}(\mathbf{x}|\mathbf{c},\mathbf{C}), \end{gather*}where we defined
\begin{gather*} \mathbf{C}^{-1} = \mathbf{A}^{-1}+ \mathbf{B}^{-1},\\ \mathbf{c}^T\mathbf{C}^{-1} = \mathbf{a}^T\mathbf{A}^{-1}+\mathbf{b}^T\mathbf{B}^{-1},\\ c = (2\pi)^{-\frac{D}{2}} \exp\left\{\frac{1}{2}\mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} + \frac{1}{2}\log|\mathbf{C}|- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}|- \frac{1}{2}\mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \frac{1}{2}\log|\mathbf{B}| \right\} \end{gather*}Coefficient $c$ can now be reduced to the required form using the matrix transformations described in part a).
The expectation value and the conditional expectation value are given by
\[ \mathbb{E}_X[x] = \int x p(x) dx,\\ \mathbb{E}_Y[f(y)] = \int f(y) p(y) dy,\\ \mathbb{E}_X[x|y] = \int x p(x|y) dx \]
We then have
\begin{gather*} \mathbb{E}_Y\left[\mathbb{E}_X[x|y]\right] = \int \mathbb{E}_X[x|y] p(y) dy\\ = \int \left[\int xp(x|y)dx\right]p(y) dy\\ = \int \int xp(x|y)p(y)dx dy\\ = \int\int xp(x,y)dxdy\\ = \int x\left[\int p(x,y) dy\right] dx\\ = \int x p(x) dx\\ = \mathbb{E}_X[x], \end{gather*}where we used the definition fo the conditional probability density
\[ p(x|y)p(y) = p(x,y) \]
If $\mathbf{x}$ is fixed, then $\mathbf{y}$ has the same distribution as $\mathbf{w}$, but with the mean shifter by $\mathbf{A}\mathbf{x} + \mathbf{b}$, that is
\[ p(\mathbf{y}|\mathbf{x}) = \mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{x} + \mathbf{b}, \mathbf{Q}) \]
Let us consider random variable $\mathbf{u} = \mathbf{A}\mathbf{x}$, it is distributed according to
\[ p(\mathbf{u}) = \mathcal{N}(\mathbf{u}|\mathbf{A}\mathbf{\mu}_x, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T). \]
Then $\mathbf{y}$ is a sum of two Gaussian random variables $\mathbf{u}$ and $\mathbf{w}$ with its mean additionally shifted by $\mathbf{b}$, that is
\[ p(\mathbf{y}) = \mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{\mu}_x + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q}), \]
that is
\[ \mathbf{\mu}_y = \mathbf{A}\mathbf{\mu}_x + \mathbf{b},\\ \mathbf{\Sigma}_y = \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q}. \]
Like in b), assuming that $\mathbf{y}$ is fixed we obtain the conditional distribution
\[ p(\mathbf{z}|\mathbf{y}) = \mathcal{N}(\mathbf{z}|\mathbf{C}\mathbf{y}, \mathbf{R}) \]
Since $\mathbf{C}\mathbf{y}$ is a Gausssian random variable with distribution $\mathcal{N}(\mathbf{C}\mathbf{\mu}_y, \mathbf{C}\mathbf{\Sigma}_y\mathbf{C}^T)$ we obtain the distribution of $\mathbf{z}$ as that of a sum of two Gaussian random variables:
\[ p(\mathbf{z})= \mathcal{N}(\mathbf{z} |\mathbf{C}\mathbf{\mu}_y, \mathbf{C}\mathbf{\Sigma}_y\mathbf{C}^T + \mathbf{R})= \mathcal{N}(\mathbf{z} |\mathbf{C}(\mathbf{A}\mathbf{\mu}_x + \mathbf{b}), \mathbf{C}(\mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q})\mathbf{C}^T + \mathbf{R}) \]
The posterior distribution $p(\mathbf{x}|\mathbf{y})$ can be obtained by applying the Bayes' theorem:
\[ p(\mathbf{x}|\mathbf{y})= \frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}= \frac{\mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{x} + \mathbf{b}, \mathbf{Q})\mathcal{N}(\mathbf{x}|\mathbf{\mu}_x,\mathbf{\Sigma}_x)} {\mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{\mu}_x + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q})} \]
Cdf is related to pdf as
\[ F_x(x) = \int_{-\infty}^xdx' f_x(x'),\\ \frac{d}{dx} F_x(x) = f_x(x) \]
and changes in the interval $[0,1]$.
The pdf of variable $y=F_x(x)$ then can be defined as
\[ f_y(y) = f_x(x) \left|\frac{dx}{dy}\right| = \frac{f_x(x)}{\left|\frac{dy}{dx}\right|} = \frac{f_x(x)}{\left|\frac{dF_x(x)}{dx}\right|} = \frac{f_x(x)}{f_x(x)} = 1, \]
i.e. $y$ is uniformly distributed in interval $[0,1]$.
Continuous Optimization
The first and the second derivatives of function 'f(x)' are given by
\[ f'(x) = 3x^2 + 12 x -3,\\ f''(x) = 6x \]
To find the positions of maxima and minima we solve equation for the points where the first derivative takes zero values:
\[ 3x^2 + 12 x -3 = 0 \Rightarrow x_{1,2} = -2 \pm \sqrt{5} \]
Substituting each of the roots into the second derivative we obtain that
\[ f''(x_1) = f''(-2 + \sqrt{5}) > 0,\\ f''(x_2) = f''(-2 - \sqrt{5}) < 0, \]
i.e. $x_1 = -2 + \sqrt{5}$ is the minimum, whereas $x_2 = -2 - \sqrt{5}$ is the maximum of function $f(x)$.
The full batch update rule for gradient descent is
\[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i\sum_{n=1}^{N}\left(\nabla L_n(\mathbf{\theta}_i)\right)^T \]
For a mini-batch of size $1$ at every step we would choose only one of the data points, i.e. we would choose randomly $n$ among the values $1...N$ and calculate the update as
\[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i\left(\nabla L_n(\mathbf{\theta}_i)\right)^T \]
True. Indeed, any two points in the intersection of the two convex sets can be connected by a path, which belongs to each of the sets (since they are convext), and therefore is also a part of the intersection.
False. Indeed, if the two sets are completely disjoint, their union may not contain the points joining a point of one set to a point of the other.
False. Subtracting one set from the other may remove some of the points that belong to the paths connecting the points of the remaining set.
True. If $f(\mathbf{x})$ and $g(\mathbf{x})$ are convex functions, it means that for any $\theta \in[0,1]$:
\[ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}),\\ g(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta g(\mathbf{x}) + (1-\theta)g(\mathbf{y}). \]
By adding these inequalities we immediately obtain the condition of convexity for their sum, $h(\mathbf{x}) = f(\mathbf{x}) + g(\mathbf{x})$:
\begin{gather*} h(\theta \mathbf{x} + (1-\theta)\mathbf{y})=\\ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) + g(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}) + \theta g(\mathbf{x}) + (1-\theta)g(\mathbf{y})=\\ \theta h(\mathbf{x}) + (1-\theta)h(\mathbf{y}) \end{gather*}False. For example; the difference of $f(x) = ax^2$ and $g(x) = bx^2$ is convex if $a \geq b$, but concave otherwise
False. For example the product of two convex functions $f(x) = (x-a)^2$ and $g(x) = (x+a)^2$, $h(x) = (x-a)^2(x+a)^2$ has two minima and a maximum at $x=0$. For examlple, if we take $x=a$, $y=-a$ and $\theta = 0.5$, then
\[ h(\theta x + (1-\theta)y) = h(0) = a^4 > 0 = 0.5 h(a) + 0.5 h(-a) = \theta h(x) + (1-\theta)h(y), \]
i.e. the condition fo convexity is not satisfied.
The maximum of a function is not a function, so it does not have property of convexity. If however we talk about the maximum as a convext set, consisting of only one point, then it is trivially convex.
We first introduce vectors $\mathbf{y} = (x_0, x_1, \xi)^T$ and $\mathbf{c} = (p_0, p_1, 1)^T$, so that $\mathbf{p}^T\mathbf{x} + \xi = \mathbf{c}^T\mathbf{y}$.
We then introduce vector $\mathbf{b} = (0, 3, 0)$ and matrix
\[ \mathbf{A} =\begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & -1 \end{bmatrix}\]
which allow us to write the three constraints as $\mathbf{A}\mathbf{y} \leq \mathbf{b}$.
We now can write the problem as a standard linear program:
\begin{gather*} \max_{\mathbf{y} \in \mathbb{R}^3} \mathbf{c}^T\mathbf{y}\\ \text{subject to } \mathbf{A}\mathbf{y} \leq \mathbf{b} \end{gather*}
Let us define $\mathbf{c} = (-5, -3)^T$, $\mathbf{b} = (33, 8, 5, -1, 8)^T$ and
\[ \mathbf{A}=\begin{bmatrix} 2 & 2\\ 2 & -4\\-2 & 1\\ 0 & -1\\ 0 & 1 \end{bmatrix}\]
The linear program is then written as
\begin{gather*} \min_{\mathbf{x} \in \mathbb{R}^2} \mathbf{c}^T\mathbf{x}\\ \text{subject to } \mathbf{A}\mathbf{x} \leq \mathbf{b} \end{gather*}The Lagrangian of this problem is
\[ \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = \mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b})= (\mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T\mathbf{A})\mathbf{x} - \mathbf{\lambda}^T\mathbf{b}= (\mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda})^T\mathbf{x} - \mathbf{\lambda}^T\mathbf{b} \]
Taking gradient in respect to $\mathbf{x}$ and setting it to zero we obtain the extremum condition
\[ \mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda} = 0, \]
that is
\[ \mathcal{D}(\mathbf{\lambda}) = \min_{\mathbf{x} \in \mathbb{R}^2} \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = - \mathbf{\lambda}^T\mathbf{b} \]
that is the dual problem is given by
\begin{gather*} \max_{\mathbf{\lambda} \in \mathbb{R}^5} - \mathbf{b}^T\mathbf{\lambda}\\ \text{subject to } \mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda} = 0 \text{ and } \mathbf{\lambda} \geq 0 \end{gather*}In terms of the original values of the parameters it can be thus written as
\[\max_{\mathbf{\lambda} \in \mathbb{R}^5} - \begin{bmatrix}33 \\ 8 \\ 5 \\ -1 \\ 8\end{bmatrix}^T\begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5\end{bmatrix}\\ \text{subject to } -\begin{bmatrix}5 \\ 3\end{bmatrix}+\begin{bmatrix}2 & 2 & -2 & 0 & 0\\2 & -4 & 1 & -1 & 1\end{bmatrix}\begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5 \end{bmatrix} = 0\\\text{and } \begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5\end{bmatrix} \geq 0\]
We introduce $\mathbf{Q} = \begin{bmatrix} 2 & 1 \\ 1 & 4 \end{bmatrix}$, $\mathbf{c} = \begin{bmatrix} 5\\3\end{bmatrix}$, \[\mathbf{A} = \begin{bmatrix} 1 & 0\\-1 & 0\\ 0 & 1\\ 0 & -1 \end{bmatrix}\] and \[\mathbf{b}=\begin{bmatrix}1\\1\\1\\1\end{bmatrix}\]
Then the quadratic problem takes form
\begin{gather*} \min_{\mathbf{x} \in \mathbb{R}^2} \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x}\\ \text{subject to } \mathbf{A}\mathbf{x} \leq \mathbf{b} \end{gather*}The Lagrangian corresponding to this problem is
\[ \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b})= \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + (\mathbf{c}^T+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{x} - \mathbf{\lambda}^T\mathbf{b} \]
where $\mathbf{\lambda} = \begin{bmatrix}\lambda_1\\\lambda_2\\\lambda_3\\\lambda_4\end{bmatrix}$ We minimize the Lagrangian by setting its gradient to zero, which results in
\[ \mathbf{Q}\mathbf{x} + \mathbf{c}+\mathbf{A}^T\mathbf{\lambda} = 0 \Rightarrow \mathbf{x} = -\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda}) \]
Substituting this back into the Lagrangian we obtain
\[ \mathcal{D}(\mathbf{\lambda})= \min_{\mathbf{x} \in \mathbb{R}^2} \mathcal{L}(\mathbf{x},\mathbf{\lambda}) =-(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})- \mathbf{\lambda}^T\mathbf{b} \]
The dual problem is now
\[ \max_{\mathbf{\lambda} \in \mathbb{R}^4} -(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})- \mathbf{\lambda}^T\mathbf{b}\\ \text{subject to } \mathbf{\lambda} \geq 0, \] where the parameter vectors and matrices are defined above.
The primal problem can be written in the standard form as
\begin{gather*} \min_{\mathbf{w}\in\mathbb{R}^D} \frac{1}{2}\mathbf{w}^T\mathbf{w}\\ \text{subject to } 1-\mathbf{x}^T\mathbf{w} \leq 0 \end{gather*}The Lagrangian is then
\[ \mathcal{L}(\mathbf{w},\lambda)= \frac{1}{2}\mathbf{w}^T\mathbf{w} + \lambda(1-\mathbf{x}^T\mathbf{w}) \]
Taking gradient in respect to $\mathbf{w}$ we obtain the position of the minimum: \[ \mathbf{w} = \lambda \mathbf{x} \]
Thus, the dual Lagrangian is
\[\mathcal{D}(\lambda)= \min_{\mathbf{w}\in\mathbb{R}^D}\mathcal{L}(\mathbf{w},\lambda)=-\frac{\lambda^2}{2}\mathbf{x}^T\mathbf{x} +\lambda \]
Assuming standard dot product, convex conjugate is defined as
\[ f^*(\mathbf{s}) = \sup_{\mathbf{x}\in\mathbb{R}^D} \mathbf{s}^T\mathbf{x} - f(\mathbf{x}) \]
Since function $f(\mathbf{x})$ is continuous, differentiable and convex, looking for the supremum is equivalent to looking for the maximum, and can be done by setting the following gradient to zero:
\[ \nabla_{\mathbf{x}}\left(\mathbf{s}^T\mathbf{x} - f(\mathbf{x})\right) = \mathbf{s}^T - \nabla_\mathbf{x}f(\mathbf{x})=0 \]
that is \[ s_d = \frac{\partial}{\partial x_d}f(\mathbf{x}) = \log x_d - 1 \Rightarrow x_d = e^{s_d + 1} \]
Then
\[ f^*(\mathbf{s})= \sum_{d=1}^Ds_dx_d - \sum_{d=1}^Dx_d\log x_d= \sum_{d=1}^Ds_d e^{s_d+1} - \sum_{d=1}^De^{s_d+1}(s_d + 1)=-\sum_{d=1}^De^{s_d+1} \]
Without loss of generality we can assume that $\mathbf{A} = \mathbf{A}^T$ is a symmetric matrix, as well as its inverse. By setting to zero gradient
\[ \nabla_{\mathbf{x}}\left(\mathbf{s}^T\mathbf{x} - f(\mathbf{x})\right)= \mathbf{s}^T - \mathbf{x}^T\mathbf{A} - \mathbf{b}= 0 \]
that is
$\mathbf{s} = \mathbf{A}\mathbf{x} + \mathbf{b} \Leftrightarrow \mathbf{x} = \mathbf{A}^{-1}(\mathbf{s} - \mathbf{b})$
Therefore
\[ f^*(\mathbf{s}) = \frac{1}{2}(\mathbf{s} - \mathbf{b})\mathbf{A}^{-1}(\mathbf{s} - \mathbf{b}) - c \]
By definition of convex conjugate:
\[L^*(\beta) = \sup_{\alpha \in \mathbb{R}} \beta\alpha - \max\{0, 1-\alpha\}=\sup_{\alpha \in \mathbb{R}}\begin{cases}\beta\alpha, \text{ if } \alpha > 1\\\beta\alpha -1 + \alpha, \text{ if } \alpha< 1 \end{cases}=\sup_{\alpha \in \mathbb{R}}\begin{cases}\beta\alpha, \text{ if } \alpha > 1\\(\beta+1)\alpha -1, \text{ if } \alpha< 1\end{cases}\]
We must here distinguish three cases:
- if $\beta>0$, then $\beta\alpha$ and $(\beta + 1)\alpha -1$ are both increasing with $\alpha$, and take their maximum values at the maximum posible value of $\alpha$, that is \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}+\infty, \text{ if } \alpha > 1\\\beta, \text{ if } \alpha< 1\end{cases}=+\infty\]
- if $\beta < -1$ then then $\beta\alpha$ and $(\beta + 1)\alpha -1$ are both decreasing with $\alpha$, and take their maximum values at the minimum possible value of $\alpha$, that is \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}\beta, \text{ if } \alpha > 1\\+\infty, \text{ if } \alpha< 1\end{cases}=+\infty\]
- Finally, if $-1 \leq \beta \leq 0$, then $\beta\alpha$ is decreasing with $\alpha$, whereas $(\beta + 1)\alpha -1$ is growing, i.e.\[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}\beta, \text{ if } \alpha > 1\\\beta, \text{ if } \alpha< 1\end{cases}=\beta\]
We thus obtain
\[L^*(\beta) =\begin{cases} \beta \text{ if } -1\leq \beta \leq 0\\+\infty \text{ otherwise} \end{cases}\]
Let us now define function $M(\alpha) = L^*(\beta) + \frac{\gamma}{2}\beta^2$ and calculate its convex conjugate function (we assume that $\gamma> 0$):
\[ M^*(\alpha)= \sup_{\beta \in \mathbb{R}} \alpha\beta - L^*(\beta) - \frac{\gamma}{2}\beta^2= \sup_{\beta \in [-1, 0]} \alpha\beta - L^*(\beta) - \frac{\gamma}{2}\beta^2= \sup_{\beta \in [-1, 0]} \alpha\beta - \beta - \frac{\gamma}{2}\beta^2, \]
where the second equality is because outside of interval $[-1,0]$ we have $-L^*(\beta) = -\infty$.
In interval $[-1,0]$ parabolic function $\alpha\beta - \beta - \frac{\gamma}{2}\beta^2 = - \frac{\gamma}{2}\left(\beta-\frac{\alpha-1}{\gamma}\right)^2 +\frac{(1-\alpha)^2}{2\gamma}$ has a maximim at $\beta = \frac{\alpha-1}{\gamma}$, if this point is inside this interval, and at one of the interval edges otherwise. More precisely
\[M(\alpha)=\begin{cases} 0, \text{ if } \alpha > 1\\ \frac{(1-\alpha)^2}{2\gamma}, \text{ if } 0\leq\alpha\leq 1\\ 1-\alpha-\frac{\gamma}{2}, \text{ if } \alpha > 1 \end{cases}\]