Mathematics for Machine Learning

credit for these solutions goes to: https://github.com/ilmoi/MML-Book

I have simply appropriated the content and fixed some minor typos. the intellectual property remains theirs.

Linear Algebra

Problem

Part

Solution

We have to check the five properties listed in Definition 2.7. Firstly, let’s look at closure. Our operation should take two elements of $\mathbb{R}\setminus\{-1\}$, and spit out a real number which isn’t $-1$. Clearly we will get a real answer, so let’s suppose we can get $-1$, and we’ll hopefully find a contradiction. Let $a,b \in \mathbb{R}\setminus\{-1\}$ be such that $-1 = ab + a+ b$. Then $ab+a+b+1 = 0$, so $(a+1)(b+1) = 0$. Therefore, either $a=-1$ or $b=-1$, which is a contradiction. Now, we attack associativity. We use the fact that $+$ is associative and commutative on $\mathbb{R}$. We have

\begin{aligned} (a\star b) \star c & = (ab+a+b) \star c \\ & = (ab+a+b)c + (ab+a+b) + c \\ & = abc + ac + bc + ab + a + b + c \\ & = abc + ab + ac + a + bc + b + c \\ & = a(bc+b+c) + a + (bc+b+c) \\ & = a(b\star c)+a + (b\star c) \\ & = a\star (b\star c). \end{aligned}

Observe that $a\star 0 = a \cdot 0 +a+0 = a$, and similarly $0\star a = a$, so $0$ is the neutral element.

Fourthly, we take $x \star y = 0$ (the neutral element we found in the previous step). Solving for $y$ in terms of $x$, we find that $ y = \frac{-x}{x+1}, $ which always exists, since $x\neq -1$. Thus such a $y$ is the inverse element for $x \in \mathbb{R}\setminus\{-1\}$. This concludes the argument that $(\mathbb{R}\setminus\{-1\},\star)$ is a group.

Finally, if we observe that $a\star b = b \star a$, by the commutativity of multiplication and addition of real numbers, then we can conclude that $(\mathbb{R}\setminus\{-1\},\star)$ is an $abelian$ group.

Part

Solution

Let’s compute the left hand side as $3\star (x \star x)$, say. We have $3 \star (x^2+2x)$, and thus $3(x^2+2x)+(x^2+2x)+3 = 15$. From this, we obtain $4x^2+8x-12 = 0$, so $x=1$ or $x=-3$. You can check that both of these do indeed work in the original equation.

Part

Solution

The major thing we need to check here is whether the operation $\oplus$ is well-defined. That is to say, if we take two different representatives of the same congruence classes, do we definitely get the same answer at the end? Let’s take $a_1, a_2 ∈ \mathbb{Z} $ such that $\overline{a_1} = \overline{a_2}$, and similarly let $b_1, b_2 ∈ \mathbb{Z} $ such that $\overline{b_1} = \overline{b_2}$. We need to show that $\overline{a_1}\oplus \overline{b_1} = \overline{a_2} \oplus \overline{b_2}$. In other words, we need to show that $\overline{a_1+b_1} = \overline{a_2+b_2}$. By the definition of the congruence class, two classes $\overline{a_1}$ and $\overline{a_2}$ are equal in $\mathbb{Z}_n$ if and only if $a_1 - a_2$ is a multiple of $n$. Let’s say $a_1 - a_2 = k_1 n$ and ${b_1 - b_2 = k_2 n}$, for some $k_1, k_2 \in \mathbb{Z}$. Observe then that $(a_1+b_1)-(a_2+b_2) = (k_1+k_2)n$, and so $\overline{a_1+b_1} = \overline{a_2+b_2}$ indeed!

From here, we use properties of addition in the integers to deduce that $(\mathbb{Z}_n,\oplus)$ is an abelian group.

$\overline{a}\oplus\overline{b} = \overline{a+b}\in \mathbb{Z}_n$

$(\overline{a}\oplus \overline{b}) \oplus \overline{c} = \overline{a+b} \oplus \overline{c} = \overline{(a+b)+c} = \overline{a+(b+c)} = \overline{a} \oplus \overline{b+c} = \overline{a}\oplus (\overline{b} \oplus \overline{c})$, where the middle equality follows from the associativity of $\mathbb{Z}$.

$\overline{0}$ is the neutral element, since $\overline{0} \oplus \overline{a} = \overline{0+a} = \overline{a}$, and similarly the other way round.

Given $\overline{a} \in \mathbb{Z}_n$, we have that $\overline{-a} = \overline {n-a}$ is the inverse element, since $\overline{a} \oplus \overline{-a} = \overline{a-a} = \overline{0}$, indeed.

Finally, $\overline{a}\oplus \overline{b} = \overline{a+b} = \overline{b+a} = \overline{b} \oplus \overline{a}$, so the group is abelian!

Part

Solution

Observe that the neutral element is $\overline{1}$. For the inverses, we have $\overline{1}^{-1} = \overline{1}$, $\overline{2}^{-1} = \overline{3}$, $\overline{3}^{-1} = \overline{2}$, and $\overline{4}^{-1} = \overline{4}$. All the axioms are satisfied, so $(\mathbb{Z}_5\setminus\{\overline{0}\} , \otimes)$ is a group.

Part

Solution

Observe that, for example, $\overline{2}\otimes \overline{4} = \overline{0}\notin \mathbb{Z}_8\setminus\{\overline{0}\}$, so the operation is not closed.

Part

Solution

If $n$ is prime, then every integer from $1$ to $n-1$ will be relatively prime to $n$. Let $a$ be such an integer. Then, by B\’ezout’s theorem, there exist integers $u, v$ such that $au+nv=1$. If we take this identity mod $n$, we have $\overline{au} \oplus \overline{nv} = \overline{1}$. But $nv$ is a multiple of $n$, so $\overline{nv} = \overline{0}$, and this simplifies to $\overline{a} \otimes \overline{u} = \overline{1}$ – in other words, $\overline{u}$ is the inverse of $\overline{a}$. Also, if we take two integers $a$ and $b$, both of which are relatively prime to $n$, then the product $ab$ also cannot contain a factor of $n$, since $n$ is prime, so the operation is closed. The other properties follow immediately. This shows that we have a group. $\lozenge$ If $n$ is not prime, then we can say $n=ab$, where $a,b\in \mathbb{Z}$, with $1<a\leq b<n$. But then $\overline{a}\otimes \overline{b} = \overline{ab} = \overline{n} = \overline{0}\notin \mathbb{Z}_n\setminus\{\overline{0}\}$, and so the operation is not closed.

Problem

Solution

Let $A_1 = \begin{bmatrix} 1 & x_1 & y_1 \\ 0 & 1 & z_1 \\ 0 & 0 & 1 \end{bmatrix}$ and $A_2 = \begin{bmatrix} 1 & x_2 & y_2 \\ 0 & 1 & z_2 \\ 0 & 0 & 1 \end{bmatrix}\in \mathcal{G}$.

Then $A_1 A_2 =\begin{bmatrix} 1 & x_1+x_2 & y_1+x_1z_2+y_2 \\ 0 & 1 & z_1+z_2 \\ 0 & 0 & 1 \end{bmatrix}\in\mathcal{G}$, so we have closure.

Associativity follows from the associativity of standard matrix multiplication.

Letting $x=y=z=0$, observe that the identity is in $\mathcal{G}$.

Finally, if we take $x_2 = -x_1$, $z_2 = -z_1$, and $y_2 = -y_1-x_1z_2$, then observe that $A_1A_2 = I_3$, and thus inverses are of the required form! Therefore, $\mathcal{G} $ is a group.

The group is not abelian, e.g. take $x_1=z_2=1$ and everything else to be $0$. Then multiplying these matrices in the other order (i.e. $x_2=z_1=1$) gives a different answer.

Problem

Part

Solution

The numbers of columns in the first matrix must equal the number of rows in the second, so this product is not defined.

Part

Solution

\[\begin{bmatrix} 4&3&5\\ 10&9&11\\ 16&15&17 \end{bmatrix}\]

Part

Solution

\[\begin{bmatrix} 5&7&9\\ 11&13&15\\ 8&10&12 \end{bmatrix}\]

Part

Solution

\[\begin{bmatrix} 14&6\\ -21&2 \end{bmatrix}\]

Part

Solution

\[\begin{bmatrix} 12&3&-3&-12\\ -3&1&2&6\\ 6&5&1&0\\ 13&12&3&2 \end{bmatrix}\]

Part

Solution

Let’s do some Gaussian elimination. We start with \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 2&5&-7&-5&-2\\ 2&-1&1&3&4\\ 5&2&-4&2&6 \end{array}\right]\].

Taking $r_2-2r_1$, $r_3-2r_1$ and $r_4-5r_1$, we have \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&-3&3&5&2\\ 0&-3&1&7&1 \end{array}\right]\].

Now taking $r_3+r_2$ and $r_4+r_2$, we have \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&0&-2&2&-2\\ 0&0&-4&4&-3 \end{array}\right]\].

Finally, taking $r_4-2r_3$, we have \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&0&-2&2&-2\\ 0&0&0&0&1 \end{array}\right]\].

Observe that the rank of the augmented matrix (4) is greater than rank of matrix of coefficients (3), so the system has no solution.

Part

Solution

We proceed with some more Gaussian elimination. From the start, we do $r_2-r_1$, $r_3-2r_1$, and $r_4+r_1$, to obtain \[\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&2&0&-3&-1&3\\ 0&1&0&1&-3&-1\\ 0&1&0&-2&0&2 \end{array}\right]\].

From here, do $r_2-2r_3$ and $r_4-r_3$ to obtain \[\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&0&0&-5&5&5\\ 0&1&0&1&-3&-1\\ 0&0&0&-3&3&3 \end{array}\right]\].

Next, if we divide $r_2$ by $-5$, and then do $r_4+3r_2$, we have \[\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&0&0&1&-1&-1\\ 0&1&0&1&-3&-1\\ 0&0&0&0&0&0 \end{array}\right]\].

Now, we do $r_1+r_3$, then swap $r_2$ and $r_3$, to give \[\left[ \begin{array}{ccccc|c} 1&0&0&1&-2&2\\ 0&1&0&1&-3&-1\\ 0&0&0&1&-1&-1\\ 0&0&0&0&0&0 \end{array}\right]\].

Finally, let’s do $r_1-r_3$ and $r_2-r_3$, to obtain \[\left[ \begin{array}{ccccc|c} 1&0&0&0&-1&3\\ 0&1&0&0&-2&0\\ 0&0&0&1&-1&-1\\ 0&0&0&0&0&0 \end{array}\right]\].

Let’s turn these back into equations. We have $x_1-x_5=3$, $x_2-2x_5=0$ and $x_4-x_5=-1$. Thus we can take $x_3$ and $x_5$ to be arbitrary, and then the others are determined. This gives a solution set of $\{ (\alpha+3, 2\alpha, \beta, \alpha-1, \alpha)^{\mathsf{T}} : \alpha, \beta \in \mathbb{R} \}$.

Problem

Solution

We start with doing $r_3-r_1$, to give us \[\left[ \begin{array}{cccccc|c} 0&1&0&0&1&0&2\\ 0&0&0&1&1&0&-1\\ 0&0&0&0&-1&1&-1 \end{array}\right]\].

Next, do $r_1+r_3$, $r+2+r_3$, and then multiply $r_3$ by $-1$. This gives \[\left[ \begin{array}{cccccc|c} 0&1&0&0&0&1&1\\ 0&0&0&1&0&1&-2\\ 0&0&0&0&1&-1&1 \end{array}\right]\].

This corresponds to the equations $x_2+x_6=1$, $x_4+x_6=-2$, and $x_5-x_6 = 1$. Now we can take $x_1$, $x_3$ and $x_6$ to be arbitrary, giving a solution set of $\{ (\alpha, 1-\beta, \gamma, -2-\beta, 1+\beta, \beta)^{\mathsf{T}} : \alpha,\beta, \gamma \in \mathbb{R} \}$.

Problem

Solution

Here, we are looking for eigenvectors associated with the eigenvalue $12$. Indeed, we don’t yet know that $12$ is an eigenvalue, but if we find ourselves unable to find such an eigenvector, then we can answer that the solution set will be empty. We have that $A-12I = \begin{bmatrix}-6&4&3\\ 6&-12&9\\ 0&8&-12 \end{bmatrix}$, and observe that $(3,3,2)^{\mathsf{T}} \in \ker(A-12I)$, so this is an eigenvector (and $12$ is an eigenvalue indeed!). Finally, we want an eigenvector such that the sum of the components is $1$, so simply divide the eigenvector above by $3+3+2=8$.

Problem

Part

Solution

Observe that $\det A = 2(24-25) -3(18-20) + 4(15-16) = -2+6-4=0$, so $A$ is not invertible.

Part

Solution

We perform Gaussian elimination on \[\left[ \begin{array}{cccc|cccc} 1&&1&&1&&&\\ &1&1&&&1&&\\ 1&1&&1&&&1&\\ 1&1&1&&&&&1 \end{array}\right]\], where a blank space denotes a 0.

Firstly, with $r_3-r_1$, $r_4-r_1$, $r_3-r_2$, and $r_4-r_2$, we have \[\left[ \begin{array}{cccc|cccc} 1&&1&&1&&&\\ &1&1&&&1&&\\ &&-2&1&-1&-1&1&\\ &&-1&&-1&-1&&1 \end{array}\right]\].

Then with $r_1+r_4$, $r_2+r_4$, and $r_3-2r_4$, we have \[\left[ \begin{array}{cccc|cccc} 1&&&&&-1&&1\\ &1&&&-1&&&1\\ &&&1&1&1&1&-2\\ &&-1&&-1&-1&&1 \end{array}\right]\].

Finally, swapping $r_3$ and $r_4$, then multiplying $r_3$ by $-1$, we have \[\left[ \begin{array}{cccc|cccc} 1&&&&&-1&&1\\ &1&&&-1&&&1\\ &&1&&1&1&&-1\\ &&&1&1&1&1&-2 \end{array}\right]\].

The matrix to the right of the vertical line is the inverse of $A$.

Part

Solution

We can relabel $\mu^3$ as $\nu$, so $\nu$ can be any real number, and then we have $A = \{ (\lambda, \lambda+\nu, \lambda-\nu)^{\mathsf{T}} : \lambda, \nu \in \mathbb{R} \}$. This has a basis of $\{(1,1,1)^{\mathsf{T}}, (0,1,-1)^{\mathsf{T}}\}$, so it is a subspace of $\mathbb{R}^3$.

Part

Solution

We cannot do the same trick as before, since the square of a real number is always at least zero. Clearly $(1,-1,0)^{\mathsf{T}}\in B$, but $-1$ times this vector, i.e. $(-1,1,0)^{\mathsf{T}}\notin B$, and thus $B$ is not a subspace.

Part

Solution

We know that $(0,0,0)^{\mathsf{T}}$ is an element of every (three-dimensional!) subspace, so we can $C$ can only be a subspace if $\gamma = 0$. In this case, we can find a basis for $C$ (say $\{ (3,0,-1)^{\mathsf{T}},(0,3,2)^{\mathsf{T}} \}$), and conclude that it is indeed a subspace.

Part

Solution

Again, this is not a subspace. Observe that $(0,1,0)^{\mathsf{T}}\in D$, so if $D$ were a subspace, then any (real!) multiple should be in $D$ also. However, $\frac12 (0,1,0)^{\mathsf{T}} \notin D$.

Part

Solution

If we form a matrix out of these three vectors and compute its determinant, we get zero. Thus, the vectors are not linearly independent.

Part

Solution

Let’s form the equation $\alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 = 0$. These three vectors are linearly independent if and only if the $only$ solution to this equation is $\alpha_1=\alpha_2=\alpha_3=0$. Looking at the third component, we have that $\alpha_1\cdot 1 + \alpha_2 \cdot 0 + \alpha_3 \cdot 0 = 0$, that is to say, $\alpha_1 = 0$.

Next, look at the second component. We already know $\alpha_1 = 0$, so we have $\alpha_2 \cdot 1 + \alpha_3 \cdot 0 = 0$, that is to say $\alpha_2=0$ also.

Finally, look at the first component. We have that $\alpha_3 \cdot 1 = 0$, so all of the $\alpha_i$’s are zero. Therefore, our three vectors are linearly independent.

Problem

Solution

Here, we need to solve the simultaneous equations $\alpha_1+\alpha_2+2\alpha_3 = 1$, $\alpha_1+2\alpha_2-\alpha_3=-2$, and $\alpha_1+3\alpha_2+\alpha_3 = 5$. Performing Gaussian elimination in the usual way, we determine that $\alpha_1 = -6$, $\alpha_2 = 3$, and $\alpha_3 = 2$. That is to say, $y=-6x_1+3x_2+2x_3$.

Problem

Solution

We write the given vectors as $v_1, \dots v_6$ from left to right. Firstly, observe that $\dim(U_1)=2$ and $\dim(U_2)=2$ (compute the rank of $\left[v_1|v_2|v_3\right]$, then $\left[v_4|v_5|v_6\right]$). Since we can write $v_3= \frac13 (v_1-2v_2)$ and $v_6=-v_4-2v_5$, we need not consider $v_3$ and $v_6$ any further. Now, if we find the rank of $\left[v_1|v_2|v_4|v_5\right]$, we get 3, so $\dim(U_1+U_2)=3$. Therefore, $\dim(U_1\cap U_2) = 2+2-3=1$. Hence, to find a basis of $U_1 \cap U_2$, we need only find any non-zero vector in the space.

Let $0\neq v \in U_1\cap U_2$. Then we can write $v=\alpha_1v_1 + \alpha_2v_2$, and $v=\alpha_4v_4 + \alpha_5v_5$. Subtracting these equations, we have $0=\alpha_1v_1 + \alpha_2v_2 -\alpha_4v_4 - \alpha_5v_5$. Remember we want a non-zero solution for $v$, and observe that the rank of $\left[v_1|v_2|v_4\right]$ is 3 (i.e. these three vectors are linearly independent). Hence we can take $\alpha_5=9$, say (this means we don’t have fractions later, but there’s no way to know this a priori!), and solve for the other $\alpha_i$’s. Using Gaussian elimination, we obtain $\alpha_1=4$, $\alpha_2=10$ and $\alpha_4 = -6$. Thus $\begin{equation*} v=4v_1+10v_2 = -6v_4+9v_5 = \begin{bmatrix} 24\\ -6\\ -12\\ -6 \end{bmatrix}. \end{equation*}$

Finally, we can write our basis of $U_1\cap U_2$ as just $\{ v \}$.

Problem

Part

Solution

We can compute $rank({A_1})=2$ and $rank({A_2})=2$ also. Since both matrices map from $\mathbb{R}^3$ to $\mathbb{R}^4$, we must have that the nullity of both of the matrices is 1. Therefore, $U_1$ and $U_2$ both have dimension 1, since they $are$ the kernels of their respective maps.

Part

Solution

Since the spaces have dimension 1, we are again simply looking for a non-zero vector in each space. Observe that $(1,1,-1)^{\mathsf{T}}$ lies in both spaces, so $\{ (1,1,-1)^{\mathsf{T}} \}$ is a basis for both.

Part

Solution

From the previous part, we have that $U_1=U_2$, so $U_1\cap U_2 = U_1$ also, and again has $\{ (1,1,-1)^{\mathsf{T}} \}$ as a basis.

Part

Solution

Observe that these matrices are the same as those used in the previous parts, except now our spaces $U_1$ and $U_2$ are different. We now have $\dim (U_1)=rank(A_1)=2$ and $\dim(U_2) = rank (A_2)=2$.

Part

Solution

We are looking for two linearly independent columns in each of our matrices – we need two non-zero columns, and they can’t be a multiple of each other. For example, the first two columns of each matrix will do as a basis for each space.

Part

Solution

Note that $rank([A_1|A_2])=3$, i.e. $\dim(U_1+U_2)=3$ (Note that $[A_1|A_2]$ is just the $4\times 6$ matrix formed by concatenating $A_1$ and $A_2$.) This means that $\dim (U_1\cap U_2) = \dim(U_1)+\dim(U_2)-\dim(U_1+U_2) = 2+2-3=1$, so again, to find a basis of $U_1\cap U_2$, we need only find a non-zero vector in the space. We proceed in a similar way to Question 2.12. Firstly, we observe that we can write $v_3=v_1+v_2$ and $v_6=v_4+v_5$, so these two vectors can safely be ignored. Secondly, observe that $\left[v_1|v_2|v_5\right]$ has rank three, so (using the notation of Question 12) if we take $\alpha_4=1$, say, and solve then we have $\alpha_1=3$, $\alpha_2=1$, and $\alpha_5=0$. In other words, our non-zero vector is $3v_1+v_2 = v_4 = (3,1,7,3)^{\mathsf{T}}$, and our basis of $U_1\cap U_2$ is $\{ (3,1,7,3)^{\mathsf{T}} \}$.

Part

Solution

Firstly, observe that $(0,0,0) \in F$, and $(0,0,0)\in G$ also. Next, we check that adding any two elements of the set does indeed get us another element of the set. Finally, if we multiply any element of the set by any real number, we again get another element of the required form. Thus $F$ and $G$ are subspaces indeed!

Part

Solution

A vector in $F\cap G$ will satisfy both conditions in the sets, so if we put $G$’s condition into $F$’s, we find $(a-b)+(a+b)-(a-3b) = 0$, from which we have $a=-3b$. Thus, $F\cap G = \{ (-4b,-2b,-6b) : b\in \mathbb{R} \} = span[ (2,1,3)]$.

Part

Solution

Doing the same dimensional analysis as the previous three questions, we find that $F\cap G$ has dimension 1. We have $F=span[(1,0,1), (0,1,1)]$, and $G=span[ (1,1,1),(-1,1,-3)]$. Proceeding in the same way as before, we find that $-4v_1-2v_2 = -3v_3+v_4$, and hence $\{(-4,-2,-6)\}$ is a basis of $F \cap G$, which agrees with Part b.

Part

Solution

Observe that $\Phi(f+g) = \int_a^b (f+g)(x) dx = \int_a^b f(x) + g(x) dx = \int_a^bf(x) dx +\int_a^b g(x) dx = \Phi(f) + \Phi(g)$, and similarly $\Phi(\alpha f) = \alpha \Phi(f)$, for all real $\alpha$, so $\Phi$ (that is to say, definite integration) is indeed linear!

Part

Solution

Similarly to the previous part, we know that differentiation is indeed linear.

Part

Solution

This is not linear – $\Phi$ doesn’t even map 0 to 0, indeed!

Part

Solution

We know from 2.7.1 that any matrix transformation like this is indeed linear. This comes from distributive properties of matrix multiplication.

Part

Solution

As before, this mapping is also linear. Indeed, this represents a clockwise rotation by $\theta$ about the origin. (See 3.9.1)

Problem

Solution

From the coefficients on the right, we have $A_\Phi = \begin{bmatrix} 3&2&1\\ 1&1&1\\ 1&-3&0\\ 2&3&1 \end{bmatrix}$.

Using Gaussian elimination, we can compute that $rank(A_\Phi) = 3$. From this we deduce that the kernel is trivial (i.e. only $(0,0,0)$), and clearly $Im(\Phi)= \{ (3x_1+2x_2+x_1,x_1+x_2+x_3,x_1-3x_2,2x_1+3x_2+x_3)^{\mathsf{T}} : x_1, x_2, x_3 \in \mathbb{R} \}$. We have $\dim(\ker(\Phi))=0$, and $\dim(Im(\Phi))=rank(A_\Phi)=3$.

Problem

Solution

We have two automorphisms, which means they map linearly and bijectively from the space, $E$, to itself. The maps therefore both have kernel $\{0\}$ (by injectivity) and image $E$ (by surjectivity). From this, we immediately deduce that $\ker(f)\cap Im(g) = \{0\}$, indeed. Similarly, we deduce that $g\circ f$ also has kernel $\{0\}$ and image $E$, as required. Note that we didn’t need the condition that $f\circ g = id_E$.

Problem

Part

Solution

Note that $rank(A_\Phi) = 3$, so $\ker(\Phi)=\{0\}$ and $Im(\Phi)=\mathbb{R}^3$.

Part

Solution

Let $P$ be the change of basis matrix from the standard basis of $B$ to $\mathbb{R}^3$. Then $P=\begin{bmatrix} 1&1&1\\ 1&2&0\\ 1&1&0 \end{bmatrix}$.

The matrix $\overline{A_\Phi}$ is given by $P^{-1}A_\Phi P = \begin{bmatrix} 6&9&1\\ -3&-5&0\\ -1&-1&0 \end{bmatrix}$.

Part

Solution

Each set $B$ and $B’$ has the correct number of (clearly!) linearly independent vectors, so they are both bases of $\mathbb{R}^2$.

Part

Solution

We write the old basis vectors ($B’$) in terms of the new ($B$), and then transpose the matrix of coefficients. We have $b_1’ = 4 b_1 + 6 b_2$, and $b_2’ = 0b_1 -b_2$. Thus $P_1 = \begin{bmatrix} 4 & 0\\ 6 & -1 \end{bmatrix}$.

Part

Solution

Let $M=[c_1|c_2|c_3]$, and observe that $\det M = 4 \neq 0$, so the vectors are linearly independent. Since $\mathbb{R}^3$ had dimension 3, and we have three linearly independent vectors, $C$ must indeed be a basis. Indeed, such an $M$ is the change of basis matrix from $C$ to $C’$ (write the old vectors in terms of the new!) and this is thus the $P_2$ we require. Thus $P_2 = \begin{bmatrix} 1&0&1\\ 2&-1&0\\ -1&2&-1 \end{bmatrix}$.

Part

Solution

Observe that by adding the given results, we find that $\Phi(b_1) = c_1 + 2c_3$; by subtracting, we have $\Phi(b_2) = -c_1 +c_2 -c_3$. Then $A_\Phi$ is given by the transpose of the matrix of coefficients, so $A_\Phi = \begin{bmatrix} 1&-1\\ 0&1\\ 2&-1 \end{bmatrix}$.

Part

Solution

We first need to apply $P_1$ to change from basis $B’$ to $B$. Then $A_\Phi$ will map us to $(\mathbb{R}^3, C)$, before $P_2$ will take us to $C’$. Remember that matrices are acting like functions here, so they are applied to (column) vectors from right to left. Therefore the multiplication we require is $A’=P_2 A_\Phi P_1$. (This is what part f is asking us to recognise.) We have $A’ = \begin{bmatrix} 0&2\\ -10&3\\ 12&-4 \end{bmatrix}$.

Part

Solution

$P_1 \begin{bmatrix}2\\3\end{bmatrix}=\begin{bmatrix}8\\9\end{bmatrix}$. $A_\Phi \begin{bmatrix}8\\9\end{bmatrix}=\begin{bmatrix}-1\\9\\7\end{bmatrix}$.

$P_2 \begin{bmatrix}-1\\9\\7\end{bmatrix}=\begin{bmatrix}6\\ -11\\12\end{bmatrix}$.

And observe that $A’ \begin{bmatrix}2\\3\end{bmatrix}=\begin{bmatrix}6\\ -11\\12\end{bmatrix}$, indeed!

Analytic Geometry

Problem

Solution

We must show that $\langle\cdot,\cdot\rangle$ is a bilinear, symmetric, positive definite map. It is a calculation to show that $\langle \alpha x + \beta x’, y \rangle = \alpha \langle x,y\rangle + \beta \langle x’,y\rangle$, and similarly for the second argument, and so $\langle\cdot,\cdot\rangle$ is bilinear.

We have $\langle y,x\rangle = y_1x_1 - (y_1x_2+y_2x_1)+2y_2x_2 = \langle x,y\rangle$ indeed, so we have symmetry.

Finally, $\langle x,x \rangle = x_1^2 - 2x_1x_2+2x_2^2 = (x_1-x_2)^2 + x_2^2$. Clearly, if $x=0$, then $\langle x,x \rangle = 0$. If $x\neq 0$, then either $x_2 \neq 0$, in which case $\langle x,x \rangle > 0$; or $x_2=0$ with $x_1 \neq 0$, so $\langle x,x \rangle > 0$ again (since it’s the sum of two squares of real numbers). Therefore $\langle \cdot , \cdot \rangle$ is an inner product indeed!

Problem

Solution

Observe that $\left\langle \begin{bmatrix} 1\\0 \end{bmatrix} , \begin{bmatrix} 0\\1 \end{bmatrix} \right\rangle = 0$, but $\left\langle \begin{bmatrix} 0\\1 \end{bmatrix} , \begin{bmatrix} 1\\0 \end{bmatrix} \right\rangle = 1$, so $\langle \cdot, \cdot \rangle$ is not symmetric, and thus not an inner product. (Observe that the matrix $A$ is not a symmetric matrix!)

Problem

We have $d(x,y) = \sqrt{\langle(x-y),(x-y)\rangle} = \sqrt{\left\langle\begin{bmatrix} 2\\3\\3 \end{bmatrix},\begin{bmatrix} 2\\3\\3 \end{bmatrix}\right\rangle}$.

Part

Solution

Thus $d(x,y) = \sqrt {22}$.

Part

Solution

Here, $d(x,y) = \sqrt{47}$.

Problem

We have that the angle $\omega$ between the two vectors is given by $\omega = \cos^{-1}\left(\frac{\langle x, y \rangle}{\lVert x\rVert\lVert y\rVert}\right)$, where $\lVert x \rVert = \sqrt{\langle x,x\rangle}$.

Part

Solution

Here, $\lVert x \rVert = \sqrt 5$ and $\lVert y \rVert = \sqrt 2$. Also, $\langle x,y \rangle = -3$. Thus $\omega = \cos^{-1}\left(\frac{-3}{\sqrt{10}}\right) = 2.8198$.

Part

Solution

Now, $\lVert x \rVert = \sqrt {18}$ and $\lVert y \rVert = \sqrt 7$. Also, $\langle x,y \rangle = -11$. Thus $\omega = \cos^{-1}\left(\frac{-11}{\sqrt{126}}\right) = \cos^{-1}\left(\frac{-11\sqrt{14}}{42}\right)$.

Part

Solution

Let $v_1, \dots, v_4$ be the four vectors defined in the question. Observe that $rank[v_1|v_2|v_3|v_4]=3$. Moreover, $rank[v_1|v_2|v_3]=3$, so these three vectors form a basis of $U$. Let $B$ be this matrix of basis vectors, i.e. $B=[v_1|v_2|v_3]$. Now we compute $B^{\mathsf{T}}B = \begin{bmatrix} 9&9&0\\ 9&16&-14\\ 0&-14&31 \end{bmatrix}$ and $B^{\mathsf{T}}x = \begin{bmatrix} 9\\23\\ -25 \end{bmatrix}$.

Next, we solve $B^{\mathsf{T}}B\lambda = B^{\mathsf{T}}x$ for $\lambda$. Using Gaussian elimination, we obtain $\lambda = \begin{bmatrix}-3\\ 4\\ 1 \end{bmatrix}$.

Finally, $\pi_U(x)$ is given by $B\lambda = \begin{bmatrix} 1\\ -5\\ -1\\ -2\\3 \end{bmatrix}$.

Part

Solution

By construction, $d(x,U) = d(x,\pi_U(x))$. This is given by $\lVert x-\pi_U(x) \rVert = \begin{Vmatrix}-2\\ -4\\0\\6\\ -2 \end{Vmatrix}$.

This norm is given by the square root of the dot product of the vector with itself, so $d(x,U) = \sqrt{60}=2\sqrt{15}$.

Part

Solution

We require $\langle e_1, e_2-\pi_U(e_2)\rangle = 0$ and $\langle e_3, e_2-\pi_U(e_2)\rangle = 0$. That is to say, $\left\langle \begin{bmatrix} 1\\0\\0 \end{bmatrix}, \begin{bmatrix}-\lambda_1\\1\\ -\lambda_2 \end{bmatrix} \right\rangle = 0$, and $\left\langle \begin{bmatrix} 0\\0\\1 \end{bmatrix}, \begin{bmatrix}-\lambda_1\\1\\ -\lambda_2 \end{bmatrix} \right\rangle = 0$.

Computing the first gives us $\lambda_1=\frac12$, while the second gives $\lambda_2 = -\frac12$.

Therefore, $\pi_U(e_2) = \begin{bmatrix} \frac12 \\ 0\\ -\frac12 \end{bmatrix}$.

Part

Solution

We have $d(e_2,U) = \lVert e_2-\pi_U(e_2) \rVert = \sqrt{\left\langle \begin{bmatrix}-\frac12 \\ 1\\ \frac12\end{bmatrix},\begin{bmatrix}-\frac12 \\ 1\\ \frac12\end{bmatrix}\right\rangle} = 1$.

Part

Solution

Let $x\in V$. Observe that $((id-\pi)\circ(id-\pi))(x) = (id-\pi)(x-\pi(x)) = (x-\pi(x))-(\pi(x)-\pi^2(x)) = x-2\pi(x)+\pi^2(x)$. Hence $(id -\pi)$ is a projection if and only if $x-\pi(x) = x-2\pi(x)+\pi^2(x)$. This happens if and only if $\pi(x) = \pi^2(x)$, that is to say, where $\pi$ is a projection.

Part

Solution

We have $Im(id-\pi) = \{(id-\pi)(x):x\in V\} = \{x-\pi(x):x\in V\}$. Observe that $\pi(x-\pi(x)) = \pi(x)-\pi^2(x) = 0$, since $\pi$ is a projection. Thus, $Im(id-\pi)\subseteq\ker\pi$. Now, suppose $k\in \ker \pi$. Then $k-\pi(k)=k$, so $k\in Im(id-\pi)$. Thus $\ker\pi\subseteq Im(id-\pi)$. Therefore, we have that $Im(id-\pi) = \ker\pi$. $\lozenge$

We have that $\ker(id-\pi) = \{x\in V : (id-\pi)(x) = 0\} = \{x\in V : x=\pi(x)\}$. Clearly, $\ker(id-\pi)\subseteq Im\pi$.

Take $x\in Im\pi$. Then there exists some $y\in V$ such that $\pi(y)=x$. Observe that $(id-\pi)(x) = x-\pi(x) = \pi(y)-\pi^2(y) = 0$, since $\pi$ is a projection. Hence $Im\pi\subseteq\ker(id-\pi)$. Therefore we have that $\ker(id-\pi) = Im\pi$.

Problem

Solution

I will assume we use the standard dot product as our inner product. Firstly, let’s get an orthogonal basis, then we can simply divide by the magnitude of each vector to get our orthonormal basis. Since $b_1$ is our first vector, there is nothing to do for now. To get the second vector in our basis, we need a vector, $b_2’$, perpendicular to $b_1$, such that $span(b_1,b_2) = span (b_1, b_2’)$. This is given by $b_2’ = b_2 - \pi_{span(b_1)}(b_2)$.

Now, $\pi_{span(b_1)}(b_2) = \frac{b_1b_1^\mathsf{T}}{\lVert b_1 \rVert^2} b_2 = \frac{1}{3} \begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}\begin{bmatrix}-1\\2\\0\end{bmatrix} = \frac{1}{3}\begin{bmatrix}1\\1\\1\end{bmatrix}$.

Therefore, $b_2’ = \begin{bmatrix}-\frac43 \\ \frac53 \\ -\frac13 \end{bmatrix}$.

Hence, our orthonormal basis, $C$, is given by $C=\left\{\frac{1}{\sqrt3}\begin{bmatrix} 1\\1\\1 \end{bmatrix} , \frac{1}{\sqrt42}\begin{bmatrix}-4\\5\\ -1 \end{bmatrix}\right\}$.

Problem

Part

Solution

Take $x = (x_1,\dots, x_n)$ and $y=(1,\dots,1)$. The Cauchy-Schwarz tells us that $\langle x, y \rangle \leq \lVert x \rVert \lVert y \rVert$, so we have $\sum_{i=1}^{n}x_i \leq \sqrt{\sum_{i=1}^{n}{x_i^2}}\sqrt{n}$. Using $\sum x_i = 1$ and squaring both sides (noting that both sides are indeed positive!) gives us the required result.

Part

Solution

Take $x = \left(\sqrt{x_1},\dots, \sqrt{x_n}\right)$ and $y = \left(\frac{1}{\sqrt{x_1}},\dots,\frac{1}{\sqrt{x_n}}\right)$. Then Cauchy-Schwarz gives us that $n \leq \sqrt{\sum x_i} \sqrt{\sum \frac{1}{x_i}}$. The same trick as before gives us the required result.

Problem

Solution

I will assume we need to rotate these vectors anticlockwise. A rotation about an angle $\theta$ is given by the matrix $\begin{bmatrix} \cos\theta&-\sin\theta\\ \sin\theta&\cos\theta \end{bmatrix}$. Thus for $\theta=30^\circ$, we have $R=\begin{bmatrix} \frac{\sqrt{3}}{2}&-\frac{1}{2}\\ \frac12 & \frac{\sqrt3}{2} \end{bmatrix} = \frac12 \begin{bmatrix} \sqrt3 & -1\\ 1&\sqrt3 \end{bmatrix}$.

Therefore, $Rx_1 = \frac12 \begin{bmatrix} 2\sqrt3-3\\2+3\sqrt3 \end{bmatrix}$ and $Rx_2 = \frac12 \begin{bmatrix} 1\\ -\sqrt3 \end{bmatrix}$.

Matrix Decompositions

Problem

Part

Solution

We have $\det A = 1(4\cdot4-6\cdot2) - 3(2\cdot 4 - 6\cdot 0) + 5(2\cdot 2-4\cdot 0) = 0$.

Part

Solution

We have $\det A = 1\cdot 4 \cdot 4 + 2\cdot 2\cdot 5 + 0\cdot 3\cdot 6 - 5\cdot 4\cdot 0 - 6\cdot 2\cdot 1 - 4\cdot 3\cdot 2 = 0$.

Problem

Solution

Perform Gaussian elimination to “fix” the first columns. I have used only the rule allowing me to add a multiple of a row to a different row, which doesn’t change the determinant. We have $\begin{bmatrix} 2&0&1&2&0\\ 0&-1&-1&-1&1\\ 0&0&1&0&3\\ 0&0&3&1&2\\ 0&0&-1&-1&1 \end{bmatrix}$.

From here, we can either continue with Gaussian elimination to get our matrix into upper triangular form, then multiply the entries on the diagonal together (remembering to take into account any elementary operations which would change the determinant!), or we can simply compute the determinant of the lower-right $3\times 3$ matrix, since this is quick to do by hand (it is $-3$). Thus the determinant of the overall matrix is $2\cdot -1\cdot -3 = 6$.

Problem

Part

Solution

If we solve the equation $\det(A-\lambda I) = 0$ for $\lambda$, we obtain $\lambda = 1$ only. (Or, indeed, we can observe that $A$ is in lower triangular form, so the eigenvalues are the entries on the main diagonal.) The space $E_1 = \{x\in \mathbb{R}^2 : (A-I)(x)=0\} = span\{(0,1)\}$.

Part

Solution

Again, solving $\det(B-\lambda I ) = 0$, we find that $\lambda=2$ or $\lambda = -3$. We then have $E_2 = span\{(1,2)\}$ and $E_{-3} = span\{ (-2,1) \}$.

Problem

Solution

If we take $\det(A-\lambda I)$ and simplify, we have $(\lambda-2)(\lambda-1)(\lambda+1)^2$., so we have three eigenvalues. For $\lambda=2,1$, our eigenspace will certainly have dimension 1. For $\lambda = -1$, it could (at this stage!) have dimension 1 or 2. Observe that $(1,0,1,1)$ is an eigenvector with eigenvalue 2, and $(1,1,1,1)$ is an eigenvector with eigenvalue 1. Thus $E_2=span\{(1,0,1,1)\}$ and $E_1 = span\{(1,1,1,1)\}$.

Now observe that $rank(A+I)=3$, so there can only be one linearly independent eigenvector with eigenvalue -1. Note that $(0,1,1,0)$ will do. Hence $E_{-1} = span\{(0,1,1,0)\}$.

Problem

Solution

Only the first two are diagonalisable – they are diagonal, so there is nothing to prove! The other two, however, are not diagonalisable – each only has one linearly independent eigenvector, whereas we need there to exist a basis of eigenvectors to yield diagonalisability. Only the first and third matrices are invertible – the determinants are non-zero, while the other two matrices have determinant zero.

This tells us that diagonalisability is indeed unrelated to invertibility!

Problem

Part

Solution

The characteristic polynomial is $(5-\lambda)(1-\lambda)^2$. Now, $rank(A-I) = 2$, so there is only one linearly independent eigenvector for $\lambda=1$. Hence $A$ is not diagonalisable. We have $E_5 = span\{(1,1,0)\}$, and $E_1=span\{(-3,1,0)\}$.

Part

Solution

The characteristic polynomial here is $-\lambda^3(1-\lambda)$, so the eigenvalues are 1, and 0 (with multiplicity 3). Observe that $rank(A-0I) = rank A = 1$, so there are three linearly independent eigenvectors for the eigenvalue 0. With the other eigenvector for $\lambda=1$, we will have a basis of eigenvectors, and hence $A$ will be diagonalisable. We have $E_1 = span\{(1,0,0,0)\}$, and $E_0 = span\{(1,-1,0,0),(0,0,1,0),(0,0,0,1)\}$.

Part

Solution

If we form the characteristic polynomial, we have $\lambda^2-4\lambda +8$, which has no roots in $\mathbb{R}$. However, if we extend to $\mathbb{C}$, then we will be able to diagonalise the matrix.

Part

Solution

This is a symmetric matrix, and is therefore diagonalisable. Its eigenvalues are $3$, and $0$ with multiplicity two, so its diagonal form is $\begin{bmatrix} 3&0&0\\ 0&0&0\\ 0&0&0 \end{bmatrix}$, and a basis of eigenvectors is $\{(1,1,1),(1,-1,0),(1,0,-1)\}$.

Part

Solution

Here, we have three distinct eigenvalues, and $\lambda = 4$ has multiplicity two. However, $rank(A-4I) = 3$, so there is only one linearly independent eigenvector, and this $A$ cannot have a basis of eigenvectors, so it is not diagonalisable.

Part

Solution

Again here we have two eigenvectors – $\lambda=1$ with multiplicity one and $\lambda=2$ with multiplicity two. However, this time, observe that $rank(A-2I)=1$, so there are indeed two linearly independent eigenvectors for this eigenvalue. Thus $A$ is diagonalisable, with diagonal form $\begin{bmatrix} 1&0&0\\ 0&2&0\\ 0&0&2 \end{bmatrix}$, with eigenvectors $\{(3,-1,3),(2,1,0),(2,0,1)\}$.

Problem

Solution

First, we compute $A^{\mathsf{T}}A = \begin{bmatrix} 13&12&2\\ 12&13&-2\\ 2&-2&8 \end{bmatrix}$. We can diagonalise this to find $D = \begin{bmatrix} 25&0&0\\ 0&9&0\\ 0&0&0 \end{bmatrix}$ and $P = \begin{bmatrix} \frac{1}{\sqrt2}&\frac{1}{\sqrt{18}}&-\frac23\\ \frac{1}{\sqrt2}&-\frac{1}{\sqrt{18}}&\frac23\\ 0&\frac{4}{\sqrt{18}}&\frac13 \end{bmatrix}$.

We take the square roots of the entries of $D$ to find $\Sigma = \begin{bmatrix} 5&0&0\\ 0&3&0 \end{bmatrix}$, with $V$ equalling our $P$ above.

From here, we compute $u_1 = \frac1{\sigma_1}Av_1 = \frac15\begin{bmatrix}3&2&2\\2&3&-2\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\\0\end{bmatrix} =\begin{bmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\end{bmatrix}$ , and $u_2 = \frac{1}{\sigma_2} A v_2 = \frac13 \begin{bmatrix}3&2&2\\2&3&-2\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt{18}}\\ -\frac{1}{\sqrt{18}}\\frac{4}{\sqrt{18}}\end{bmatrix} =\begin{bmatrix}\frac{1}{\sqrt2}\\ -\frac{1}{\sqrt2}\end{bmatrix}$, giving $U=\frac{1}{\sqrt2}\begin{bmatrix}1&1\\1&-1\end{bmatrix}$.

It can be checked that $A = U\Sigma V^{\mathsf{T}}$, indeed!

Problem

Solution

Observe that the eigenvalues of $A$ are complex, so we cannot simply find the eigendecomposition. Proceeding as in the previous question, we have $A^{\mathsf{T}}A = \begin{bmatrix}5&3\\3&5\end{bmatrix}$, which when we perform the eigendecomposition on this new matrix, we obtain $D = \begin{bmatrix}8&0\\0&2\end{bmatrix}$ and $P=\frac{1}{\sqrt2}\begin{bmatrix}1&-1\\1&1\end{bmatrix}$. This $P$ is again our required $V$, and we have $\Sigma = \begin{bmatrix}2\sqrt2 & 0\\0&\sqrt2\end{bmatrix}$.

As before, we can now compute

$u_1 = \frac{1}{2\sqrt2}\begin{bmatrix}2&2\\ -1&1\end{bmatrix}\begin{pmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\end{pmatrix} = \begin{bmatrix}1\\0\end{bmatrix}$, and similarly $u_2 = \begin{bmatrix}0\\1\end{bmatrix}$. Hence, our $U$ turns out to be the identity matrix.

Problem

Solution

Using our Singular value decomposition from Question 8, we construct $A_1 = \sigma_1 u_1 v_1^{\mathsf{T}} = 5 \begin{bmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt2}&\frac{1}{\sqrt2}&0\end{bmatrix} =\frac52 \begin{bmatrix}1&1&0\\1&1&0\end{bmatrix}$, and similarly $A_2 = \frac12 \begin{bmatrix}1&-1&4\\ -1&1&-4\end{bmatrix}$.

Then $A_1$ and $A_2$ both have rank one, with $A=A_1+A_2$, as required.

Problem

Solution

We need only show this in one direction, since we can replace $A$ with $A^{\mathsf{T}}$ below, and the argument still hold true. Suppose $\lambda\neq 0$ is an eigenvalue of $A^{\mathsf{T}}A$. Then there is some vector $x\neq 0$ such that $A^{\mathsf{T}}Ax = \lambda x$. Thus, $AA^{\mathsf{T}}Ax = \lambda Ax$. Therefore, $Ax$ is an eigenvector of $AA^{\mathsf{T}}$, with eigenvalue $\lambda$. Please not that $x$ is in general not the eigenvector of $AA^{\mathsf{T}}$, in fact it can’t even be multiplied by it, because it is $n$-dimensional, and not $m$-dimensional.

Problem

Solution

The left hand side describes the largest $\Vert Ax \Vert_2$ can be, relative to $\Vert x \Vert_2$. That is to say, it represents the biggest scaling that can take place under $A$. If we write $A=U\Sigma V^{\mathsf{T}}$, and then note that $U$ and $V^{\mathsf{T}}$ are both change of basis matrices, we deduce that only $\Sigma$ is doing any scaling. But $\Sigma$ is an ``almost diagonal’’ matrix, and hence it scales based on its non-zero entries only. Remember that (by convention!) all the entries on the main diagonal of $\Sigma$ are non-negative. Hence the largest such value represents the biggest scaling factor, which is the right-hand side of the identity.

Vector Calculus

Problem

Solution

Firstly, let’s rewrite $f(x)$ as $f(x) = 4\log (x)\sin(x^3)$. Then, $f’(x) = \frac4x\sin(x^3)+12x^2\log(x)\cos(x^3)$.

Problem

Solution

If we rewrite our function as $f(x) = (1+\exp(-x))^{-1}$, then we have $f’(x) = \frac{\exp(-x)}{(1+\exp(-x))^2}$.

Problem

Solution

We have $f’(x) = \frac{\mu-x}{\sigma^2} f(x)$.

Problem

Solution

We compute the first five derivatives of our function at 0. We have $f(0)=f’(0)=1$, $f^{(2)}(0)=f^{(3)}(0)=-1$, and $f^{(4)}(0)=f^{(5)}(0)=1$. The Taylor polynomial $T_5(x) = 1+x-\frac12 x^2 -\frac16 x^3 + \frac1{24} x^4 + \frac1{120}x^5$. The lower-order Taylor polynomials can be found by truncating this expression appropriately.

Problem

Part

Solution

We can see below that $\frac{\partial f_1}{\partial x}$ has dimension $1\times2$; $\frac{\partial f_2}{\partial x}$ has dimension $1\times n$; and $\frac{\partial f_3}{\partial x}$ has dimension $n^2\times n$.

Part

Solution

We have $\frac{\partial f_1}{\partial x} = \begin{bmatrix} \cos(x_1)\cos(x_2)&-\sin(x_1)\sin(x_2) \end{bmatrix}$; $\frac{\partial f_2}{\partial x} = y^{\mathsf{T}}$. (Remember, $y$ is a column vector!)

Part

Solution

Note that $xx^{\mathsf{T}}$ is the matrix $\begin{bmatrix} x_1^2 & x_1x_2 & \cdots & x_1x_n \\ x_2x_1 & x_2^2 & \cdots & x_2x_n \\ \vdots & \vdots & \ddots & \vdots \\ x_nx_1 & x_nx_2 & \cdots & x_n^2 \end{bmatrix}$. Thus its derivative will be a higher-order tensor. However, if we consider the matrix to be an $n^2$-dimensional object in its own right, we can compute the Jacobian. Its first row consists of \[\begin{bmatrix} 2x_1&x_2&\cdots&x_n | x_2&0&\cdots&0 | \cdots |x_n&0&\cdots&0 \end{bmatrix}\] where I have inserted a vertical bar every $n$ columns, to aid readability.

Problem

Solution

We have $\frac{df}{dt} = \cos(\log(t^{\mathsf{T}}t))\cdot\frac1{t^{\mathsf{T}}t}\cdot \begin{bmatrix} 2t_1&2t_2&\cdots&2t_D \end{bmatrix}= \cos(\log(t^{\mathsf{T}}t))\cdot\frac{2t^{\mathsf{T}}}{t^{\mathsf{T}}t}$.

For $g$, if we explicitly compute $AXB$ and find its trace, we have that $g(X) = \sum_{k=1}^D \sum_{j=1}^F \sum_{i=1}^E a_{ki}x_{ij}b_{jk}$. Thus we have, $\frac{\partial g}{\partial x_{ij}} = \sum_{k=1}^D b_{jk}a_{ki}$, and this is the $(i,j)$-th entry of the required derivative.

If we look at the last equation, we can see that the $(i, j)$-th partial derivative is the product of $j$-th row of $B$ and $i$-th column of $A$. Therefore, the right side of the equation is $\left(BA\right)_{ji}$. Since the order of the indices is reversed, i.e., the $(i, j)$-th partial derivative is the $(j, i)$-th element of $BA$, $\frac{dg}{dX} = (BA)^{\mathsf{T}} = A^{\mathsf{T}}B^{\mathsf{T}}$.

Checking the dimensions, we can see that $\left(BA\right)^{\mathsf{T}} \in R^{E \times F}$ which corresponds with the dimensions of $X$.

Problem

Part

Solution

The chain rule tells us that $\frac{df}{dx} = \frac{df}{dz}\frac{dz}{dx}$, where $\frac{df}{dz}$ has dimension $1\times 1$, and $\frac{dz}{dx}$ has dimension $1\times D$. We know $\frac{dz}{dx}=2x^{\mathsf{T}}$ from $f$ in Question 6. Also, $\frac{df}{dz} = \frac{1}{1+z}$. Therefore, $\frac{df}{dx} = \frac{2x^{\mathsf{T}}}{1+x^{\mathsf{T}}x}$.

Part

Solution

Here we have $\frac{df}{dz}$ is an $E\times E$ matrix, namely $\begin{bmatrix} \cos z_1 & 0 & \cdots & 0\\ 0 & \cos z_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0&\cdots&\cos z_E \end{bmatrix}$.

Also, $\frac{dz}{dx}$ is an $E× D$-dimensional matrix, namely $A$ itself.

The overall derivative is obtained by multiplying these two matrices together, which will again give us an $E× D$-dimensional matrix.

Part

Solution

We have $\frac{df}{dz}$ has dimension $1\times 1$, and is simply $-\frac12 \exp(-\frac12 z)$. Now, $\frac{dz}{dy}$ has dimension $1\times D$, and is given by $y^{\mathsf{T}}(S^{-1}+(S^{-1})^{\mathsf{T}})$.

Finally, $\frac{dy}{dx}$ has dimension $D\times D$, and is just the identity matrix.

Again, we multiply these all together to get our final derivative.

Part

Solution

If we explicitly write out $xx^{\mathsf{T}}+\sigma^2I$, and compute its trace, we find that $f(x) = x_1^2 + \dots + x_n^2 + n\sigma^2$. Hence, $\frac{df}{dx} = 2x^{\mathsf{T}}$.

Part

Solution

Here, $\frac{df}{dz} = \begin{bmatrix} \frac{1}{\cosh^2z_1}&0&\cdots&0\\ 0&\frac{1}{\cosh^2z_2}&\cdots&0\\ \vdots & \vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{1}{\cosh^2z_M} \end{bmatrix}$, while $\frac{dz}{dx} = A$, as in Question 7b.

Finally, $\frac{df}{dx}$ is given by the product of these two matrices.

Problem

Solution

Piecing this together, replacing $z$ with $t(\epsilon,\nu)$ throughout, we have that $g(\nu) = \log(p(x,t(\epsilon,\nu))) - \log(q(t(\epsilon,\nu),\nu))$. Therefore, $\frac{dg}{d\nu} = \frac{p’(x,t(\epsilon,\nu))\cdot t’(\epsilon,\nu) }{p(x,t(\epsilon,\nu))} - \frac{q’(t(\epsilon,\nu),\nu)\cdot t’(\epsilon,\nu)}{q(t(\epsilon,\nu),\nu)} $.

Probability and Distributions

Problem

Solution

The marginal distributions are obtained by summing the probabilies over all the values of the variable being marginalized. Thus, to obtain $p(x)$ we sum over columns (i.e., over the values corresponding to different $y$):

\begin{aligned} p(x_1) &= P(X = x_1) = P(X = x_1, Y = y_1) + P(X = x_1, Y = y_2) + P(X = x_1, Y = y_3) = 0.01 + 0.05 + 0.1 = 0.16 \\ p(x_2) &= P(X = x_2) = P(X = x_2, Y = y_1) + P(X = x_2, Y = y_2) + P(X = x_2, Y = y_3) = 0.02 + 0.1 + 0.05 = 0.17\\ p(x_3) &= P(X = x_3) = P(X = x_3, Y = y_1) + P(X = x_3, Y = y_2) + P(X = x_3, Y = y_3) = 0.03 + 0.05 + 0.03 = 0.11\\ p(x_4) &= P(X = x_4) = P(X = x_4, Y = y_1) + P(X = x_4, Y = y_2) + P(X = x_4, Y = y_3) = 0.1 + 0.07 + 0.05 = 0.22\\ p(x_5) &= P(X = x_5) = P(X = x_5, Y = y_1) + P(X = x_5, Y = y_2) + P(X = x_5, Y = y_3) = 0.1 + 0.2 + 0.04 = 0.34 \end{aligned}

As a correctness check, note that this distribution satisfies the normalization condition, i.e. that sum of the probabilities is $1$:

\begin{equation} \sum_{i=1}^5 p(x_i) = 1 \end{equation}

The marginal distribution $p(y)$ can be obtained in a similar way, by summing the matrix rows:

\begin{aligned} p(y_1) &= P(Y = y_1) = \sum_{i=1}^5 P(X = x_i, Y = y_1) = 0.01 + 0.02 + 0.03 + 0.1 + 0.1 = 0.26 \\ p(y_2) &= P(Y = y_2) = \sum_{i=1}^5 P(X = x_i, Y = y_2) = 0.05 + 0.1 + 0.05 + 0.07 + 0.2 = 0.47 \\ p(y_3) &= P(Y = y_3) = \sum_{i=1}^5 P(X = x_i, Y = y_3) = 0.1 + 0.05 + 0.03 + 0.05 + 0.04 = 0.27 \end{aligned}

We can again check that the normalization condition is satisfied:

\begin{equation} \sum_{i=1}^3p(y_i) = 1 \end{equation}

To determine conditional distributions we use the definition of the conditional probability:

\[P(X = x , Y = y_1) = P(X = x | Y = y_1)P(Y = y_1) = p(x | Y = y_1) p(y_1).\]

Thus,

\begin{align*} p(x_1 | Y = y_1) &= \frac{P(X = x_1, Y = y_1)}{p(y_1)} = \frac{0.01}{0.26} \approx 0.038\\ p(x_2 | Y = y_1) &= \frac{P(X = x_2, Y = y_1)}{p(y_1)} = \frac{0.02}{0.26} \approx 0.077\\ p(x_3 | Y = y_1) &= \frac{P(X = x_3, Y = y_1)}{p(y_1)} = \frac{0.03}{0.26} \approx 0.115\\ p(x_4 | Y = y_1) &= \frac{P(X = x_4, Y = y_1)}{p(y_1)} = \frac{0.1}{0.26} \approx 0.385\\ p(x_5 | Y = y_1) &= \frac{P(X = x_5, Y = y_1)}{p(y_1)} = \frac{0.1}{0.26} \approx 0.385 \end{align*}

Likewise the conditional distribution $p(y | X = x_3)$ is given by

\begin{align*} p(y_1 | X = y_3) &= \frac{P(X = x_3, Y = y_1)}{p(x_3)} = \frac{0.03}{0.11} \approx 0.273\\ p(y_2 | X = y_3) &= \frac{P(X = x_3, Y = y_2)}{p(x_3)} = \frac{0.05}{0.11} \approx 0.454\\ p(y_3 | X = y_3) &= \frac{P(X = x_3, Y = y_3)}{p(x_3)} = \frac{0.03}{0.11} \approx 0.273 \end{align*}

Problem

Solution

We can write the probability density of the two-dimensional distribution as \[p(x,y)= 0.4\mathcal{N}\left(x, y|\begin{bmatrix} 10\\ 2\end{bmatrix}, \begin{bmatrix} 1&0\\0&1\end{bmatrix}\right)+ 0.6\mathcal{N}\left(x, y|\begin{bmatrix} 0\\ 0\end{bmatrix}, \begin{bmatrix} 8.4&2.0\\2.0&1.7\end{bmatrix}\right)\]

The marginal distribution of a weighted sum of distributions is given by the weighted sum of marginals, whereas the marginals of a bivariate normal distribution $\mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})$ are obtained according to the rule

\[\int \mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})dy= \mathcal{N}(x|\mu_x, \Sigma_{xx}), \\ \int \mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})dx = \mathcal{N}(y|\mu_y, \Sigma_{yy}) \]

Thus, the marginals of the distribution of interest are

\begin{gather*} p(x) = 0.4\mathcal{N}(x| 10, 1) + 0.6\mathcal{N}(x| 0, 8.4),\\ p(y) = 0.4\mathcal{N}(x| 2, 1) + 0.6\mathcal{N}(x| 0, 1.7) \end{gather*}

The mean of a weighted sum of two distributions is the weighted sum of their averages

\begin{gather*} \mathbb{E}_X[x] = 0.4*10 + 0.6*0 = 4,\\ \mathbb{E}_Y[y] = 0.4*2 + 0.6*0 = 0.8 \end{gather*}

The mode of a continuous distribution is a point where this distribution has a peak. It can be determined by solving the extremum condition for each of the marginal distributions:

\begin{gather*} \frac{dp(x)}{dx} = 0,\\ \frac{dp(y)}{dy} = 0 \end{gather*}

In the case of a mixture of normal distributions these equations are non-linear and can be solved only numerically. After finding all the solutions of these equations one has to verify for every solution that it is a peak rather than an inflection point, i.e. that at this point

\[\frac{d^2p(x)}{dx^2} < 0 \text{ or } \frac{d^2p(y)}{dy^2} < 0\]

The medians $m_x, m_y$ can be determined from the conditions

\begin{gather*} \int_{-\infty}^{m}p(x)dx = \int^{+\infty}_{m}p(x)dx,\\ \int_{-\infty}^{m}p(y)dy = \int^{+\infty}_{m}p(y)dy \end{gather*}

Again, these equations can be solved here only numerically.

The mean of a two-dimensional distribution is a vector of means of the marginal distributions

\[\mathbf{\mu} = \begin{bmatrix}4\\0.8\end{bmatrix}\]

The mode of two dimensional distribution is obtained first by solving the extremum conditions

\[\frac{\partial p(x,y)}{\partial x} = 0, \frac{\partial p(x,y)}{\partial y} = 0\]

and then verifying for every solution that it is indeed a peak, i.e.

\begin{gather*} \frac{\partial^2 p(x,y)}{\partial x^2} < 0, \frac{\partial^2 p(x,y)}{\partial y^2} < 0,\\ \det\left( \begin{bmatrix} \frac{\partial^2 p(x,y)}{\partial x^2} & \frac{\partial^2 p(x,y)}{\partial x\partial y}\\ \frac{\partial^2 p(x,y)}{\partial x\partial y} & \frac{\partial^2 p(x,y)}{\partial y^2} \end{bmatrix} \right) > 0 \end{gather*}

Again, these equations can be solved only numerically.

Problem

Solution

The conjugate prior to the Bernoulli distribution is the Beta distribution \[p(\mu | \alpha, \beta) =\frac{1}{\mathcal{B}(\alpha, \beta)} \mu^{\alpha -1}(1-\mu)^{\beta-1} \propto \mu^{\alpha -1}(1-\mu)^{\beta-1},\]

where $\alpha,\beta$ are not necessarily integers and the normalisation coefficient si the Beta function defined as

\[\mathcal{B}(\alpha, \beta) = \int_0^1 t^{\alpha -1}(1-t)^{\beta-1}dt\]

The likelihood of observing data $\{x_1, x_2, …, x_N\}$ is

\[p(x_1, …, x_N|\mu) = \prod_{i=1}^Np(x_i|\mu) = \prod_{i=1}^N \mu^{x_i}(1-\mu)^{1-x_i} = \mu^{\sum_{i=1}^N x_i}(1-\mu)^{N-\sum_{i=1}^N x_i}\]

The posterior distribution is proportional to the product of this likelihood with the prior distribution (Bayes theorem):

\[p(\mu |x_1, …, x_N) \propto p(x_1, …, x_N|\mu)p(\mu | \alpha, \beta) \propto \mu^{\sum_{i=1}^N x_i + \alpha -1}(1-\mu)^{N-\sum_{i=1}^N x_i +\beta -1}\]

This is also a Beta distribution, i.e. our choice of the gonjugate prior was correct. The normalization constant is readily determined:

\[p(\mu |x_1, …, x_N) = \frac{1}{\mathcal{B}(\sum_{i=1}^N x_i + \alpha -1, N-\sum_{i=1}^N x_i +\beta -1)} \mu^{\sum_{i=1}^N x_i + \alpha -1}(1-\mu)^{N-\sum_{i=1}^N x_i +\beta -1}\]

Problem

Solution

The probabilities of picking a mango or an apple from the first bag are given by

\begin{gather*} p(\text{mango} |1) = \frac{4}{6} = \frac{2}{3}\\ p(\text{apple} |1) = \frac{2}{6} = \frac{1}{3} \end{gather*}

The probabilities of picking a mango or an apple from the second bag are

\begin{gather*} p(\text{mango} |2) = \frac{4}{8} = \frac{1}{2}\\ p(\text{apple} |2) = \frac{4}{8} = \frac{1}{2} \end{gather*}

The probability of picking the first or the second bag are equal to the probabilities of head and tail respectively:

\begin{gather*} p(1) = 0.6,\\ p(2) = 0.4 \end{gather*}

We now can obtain the probability that the mango was picked from the second bag using Bayes’ theorem:

\[ p(2 | \text{mango}) = \frac{p(\text{mango} | 2)p(2)}{p(\text{mango})} = \frac{p(\text{mango} | 2)p(2)}{p(\text{mango} | 1)p(1) + p(\text{mango} | 2)p(2)} = \frac{\frac{1}{2}0.4}{\frac{2}{3}0.6 + \frac{1}{2}0.4} = \frac{1}{3} \]

Problem

Solution

$\mathbf{x}_{t+1}$ is obtained from $\mathbf{x}_{t}$ by a linear transformation, $\mathbf{A}\mathbf{x}_{t}$ and adding a Gaussian random variabme $\mathbf{w}$. Initial distribution for $\mathbf{x}_{0}$ is a Gaussian distribution, a linear transformation of a Gaussian random variable is also a Gaussian random variable, whereas a sum of Gaussian random variables is a Gaussian random variable. Thus, the joint distribution $p(\mathbf{x}_{0}, \mathbf{x}_{1},…,\mathbf{x}_{T})$ is also a Gaussian distribution.

Problem

Part

Solution

Let $\mathbf{z} = \mathbf{A}\mathbf{x}_{t+1}$. Since this is a linear transformation of a Gaussian random variable, $\mathbf{x}_t \sim \mathcal{N}(\mathbf{\mu}_t,\mathbf{\Sigma})$, then $\mathbf{z}$ is distributed as (see Eq. (6.88)) \[\mathbf{z} \sim \mathcal{N}(\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T),\]

whereas the mean and the covariance of a sum of two Gaussian random variables are given by the sum of the means and the covariances of these variables, i.e.,

\[\mathbf{x}_{t+1} = \mathbf{z} + \mathbf{w} \sim \mathcal{N}(\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}),\]

That is

\[p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}). \]

Part

Solution

If we assume that $\mathbf{x}_{t+1}$ is fixed, then $\mathbf{y}_{t+1} = \mathbf{C}\mathbf{x}_{t+1} + \mathbf{v}$ follows the same distribution as $\mathbf{v}$, but with the mean shifted by $\mathbf{C}\mathbf{x}_{t+1}$, i.e. \[ p(\mathbf{y}_{t+1}|\mathbf{x}_{t+1}, \mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}). \]

The the joint probability is obtained as

\[ p(\mathbf{y}_{t+1}, \mathbf{x}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)= p(\mathbf{y}_{t+1}|\mathbf{x}_{t+1}, \mathbf{y}_1,…,\mathbf{y}_t) p(\mathbf{x}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}) \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}). \]

Part

Solution

Let us introduce temporary notation \[ \mathbf{\mu}_{t+1} = \mathbf{A}\mathbf{\mu}_t,\\ \mathbf{\Sigma}_{t+1} = \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q},\\ p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t) = \mathcal{N}(\mathbf{\mu}_{t+1}, \mathbf{\Sigma}_{t+1}) \]

Then $\mathbf{y}_{t+1}$ is obtained in terms of the parameters of distribution $p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t)$ following the same steps as question 1), with the result

\[ p(\mathbf{y}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{\mu}_{t+1}, \mathbf{C}\mathbf{\Sigma}_{t+1}\mathbf{C}^T + \mathbf{R})= \mathcal{N}\left(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{A}\mathbf{\mu}_t, \mathbf{C}(\mathbf{A}\mathbf{\Sigma}\mathbf{A}^T+ \mathbf{Q})\mathbf{C}^T + \mathbf{R}\right). \]

The required conditional distribution is then obtained as

\[ p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t, \mathbf{y}_{t+1})= \frac{p(\mathbf{y}_{t+1}, \mathbf{x}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)} {p(\mathbf{y}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)}= \frac{\mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}) \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q})} {\mathcal{N}\left(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{A}\mathbf{\mu}_t, \mathbf{C}(\mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q})\mathbf{C}^T + \mathbf{R}\right)} \]

Problem

Solution

The standard definition of variance is \[ \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2], \]

where

\[ \mu = \mathbb{E}_X[x]. \]

Using the properties of average we can write:

\begin{gather*} \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2] = \mathbb{E}_X[x^2 - 2x\mu +\mu^2] = \mathbb{E}_X[x^2] - \mathbb{E}_X[2x\mu] + \mathbb{E}_X[\mu^2]=\\ \mathbb{E}_X[x^2] - 2\mu\mathbb{E}_X[x] + \mu^2 = \mathbb{E}_X[x^2] - 2\mu^2 + \mu^2 = \mathbb{E}_X[x^2] - \mu^2 \end{gather*}

By substituting to this equation the definition of $\mu$, we obtain the desired equation

\[ \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2] = \mathbb{E}_X[x^2] - (\mathbb{E}_X[x])^2 \]

Problem

Solution

Let us expand the square in the left-hand side of (6.45) \[ \frac{1}{N^2}\sum_{i,j=1}^N(x_i - x_j)^2 = \frac{1}{N^2}\sum_{i,j=1}^N(x_i^2 - 2x_i x_j + x_j^2) = \frac{1}{N^2}\sum_{i,j=1}^N x_i^2 - 2\frac{1}{N^2}\sum_{i,j=1}^N x_i x_j + \frac{1}{N^2}\sum_{i,j=1}^Nx_j^2 \]

We see that the first and the last term differ only by the summation index, i.e. they are identical: \[ \frac{1}{N^2}\sum_{i,j=1}^N x_i^2 + \frac{1}{N^2}\sum_{i,j=1}^Nx_j^2= 2\frac{1}{N^2}\sum_{i,j=1}^N x_i^2 = 2\frac{1}{N}\sum_{i=1}^N x_i^2, \]

since summation over $j$ gives factor $N$.

The remaining term can be written as

\[ 2\frac{1}{N^2}\sum_{i,j=1}^N x_i x_j = 2\frac{1}{N^2}\sum_{i=1}^N x_i \sum_{i=1}^N x_j = 2\left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2, \]

where we again used the fact that the sum is invariant to the index of summation.

We thus have proved the required relation that

\[ \frac{1}{N^2}\sum_{i,j=1}^N(x_i - x_j)^2 = 2\frac{1}{N}\sum_{i=1}^N x_i^2 - 2\left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2 \]

Problem

Solution

Bernoulli distribution is given by \[ p(x|\mu) = \mu^x (1-\mu)^{1-x} \]

We can use relation

\[ a^x = e^{x\log a} \]

to write the Bernoulli distribution as

\[ p(x|\mu) = e^{x\log\mu + (1-x)\log(1-\mu)}= e^{x\log\left(\frac{\mu}{1-\mu}\right) + \log(1-\mu)} = h(x)e^{\theta x - A(\theta)}, \]

where the last equation is the definition of a single-parameter distribution from the exponential family, in which

\[ h(x) = 1,\\ \theta = \log\left(\frac{\mu}{1-\mu}\right) \leftrightarrow \mu = \frac{e^\theta}{1+e^\theta},\\ A(\theta) = -\log(1-\mu) = \log(1+e^\theta) \]

Problem

Solution

The binomial distribution can be transformed as \[ p(x|N,\mu) = {N\choose x} \mu^x (1-\mu)^{N-x} = {N \choose x} e^{x\log\mu + (N-x)\log(1-\mu)}= {N \choose x}e^{x\log\left(\frac{\mu}{1-\mu}\right) +N\log(1-\mu)} = h(x)e^{x\theta - A(\theta)} \]

where

\[ h(x) = {N \choose x},\\ \theta = \log\left(\frac{\mu}{1-\mu}\right),\\ A(\theta) = -N\log(1-\mu) = N\log(1+e^\theta) \]

i.e., the binomial distribution can be represented as an exponential family distribution(only $\mu$ is treated here as a parameter, since the number of trials $N$ is fixed.)

Similarly, the beta distribution can be transformed as

\[ p(x |\alpha, \beta) = \frac{1}{\mathcal{B}(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} = e^{(\alpha -1)\log x + (\beta -1)\log(1-x) - \log(\mathcal{B}(\alpha,\beta))}= h(x)e^{\theta_1\phi_1(x) + \theta_2\phi_2(x) -A(\theta_1, \theta_2)} \]

where

\begin{gather*} h(x) = 1,\\ \theta_1 = \alpha-1, \theta_2 = \beta-1,\\ \phi_1(x) = \log x, \phi_2(x) = \log(1-x),\\ A(\theta_1, \theta_2) = \log(\mathcal{B}(\alpha,\beta)) = \log(\mathcal{B}(1+\theta_1,1 + \theta_2)) \end{gather*}

i.e. this is a distribution form the exponential family.

The product of the two distributions is then given by

\begin{gather*} p(x|N,\mu) p(x|\alpha, \beta)={N \choose x}e^{x\log\left(\frac{\mu}{1-\mu}\right) + (\alpha-1)\log x + (\beta -1)\log(1-x) + N\log(1-\mu) - \log(\mathcal{B}(\alpha,\beta))}\\ = h(x) e^{\theta_1 \phi_1(x) + \theta_2 \phi_2(x) + \theta_3\phi_3(x) - A(\theta_1, \theta_2, \theta_3)} \end{gather*}

where

\begin{gather*} h(x) = {N \choose x},\\ \theta_1 = \alpha-1, \theta_2 = \beta-1,\theta_3 = \log\left(\frac{\mu}{1-\mu}\right)\\ \phi_1(x) = \log x, \phi_2(x) = \log(1-x), \phi_3(x) = x\\ A(\theta_1, \theta_2, \theta_3) = \log(\mathcal{B}(\alpha,\beta)) -N\log(1-\mu) = \log(\mathcal{B}(1+\theta_1,1 + \theta_2)) + N\log(1+e^\theta_3) \end{gather*}

Problem

Solution

The two normal distributions are given by

\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) = (2\pi)^{-\frac{D}{2}}|\mathbf{A}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})\right],\\ \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-\frac{D}{2}}|\mathbf{B}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})\right] \end{gather*}

their product is

\[ \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-D}|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left[(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})+(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})\right]\right\} \]

The expression in the exponent can be written as

\begin{gather*} \Phi = (\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})+(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})=\\ \mathbf{x}^T\mathbf{A}^{-1}\mathbf{x} - \mathbf{a}^T\mathbf{A}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{A}^{-1}\mathbf{a}+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a}+ \mathbf{x}^T\mathbf{B}^{-1}\mathbf{x} - \mathbf{b}^T\mathbf{B}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{B}^{-1}\mathbf{b}+ \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}=\\ \mathbf{x}^T(\mathbf{A}^{-1}+\mathbf{B}^{-1})\mathbf{x}- (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})\mathbf{x}- \mathbf{x}^T(\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} \end{gather*}

we now introduce notation

\begin{gather*} \mathbf{C}^{-1} = (\mathbf{A}^{-1}+\mathbf{B}^{-1}),\\ \mathbf{c} = \mathbf{C}(\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b}),\\ \mathbf{c}^T = (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})C\text{ (This can be checked by transposing the previous equation)} \end{gather*}

The expression in the exponent now takes form

\begin{gather*} \Phi= \mathbf{x}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{C}^{-1}\mathbf{c}+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}=\\ \mathbf{x}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{C}^{-1}\mathbf{c}+ \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} + \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}- \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c}=\\ (\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \end{gather*}

where we have completed the square.

The product of the two probability distributions can be now written as

\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-D}|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left[(\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\}=\\ (2\pi)^{-\frac{D}{2}}|\mathbf{C}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})\right] \times (2\pi)^{-\frac{D}{2}}\frac{|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}}}{|\mathbf{C}|^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left[ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\}=\\ c\mathcal{N}(\mathbf{c}|\mathbf{c}, \mathbf{C}), \end{gather*}

where we defined

\[ c = (2\pi)^{-\frac{D}{2}}\frac{|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}}}{|\mathbf{C}|^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left[ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\} \]

We now can used the properties that a) the determinant of a matrix product is product of the determinants, and b) determinant of a matrix inverse is the inverse of the determinant of this matrix, and write

\[ \frac{|\mathbf{A}||\mathbf{B}|}{|\mathbf{C}|}= \|\mathbf{A}||\mathbf{C}^{-1}||\mathbf{B}|= \|\mathbf{A}\mathbf{C}^{-1}\mathbf{B}|= \|\mathbf{A}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{B}|= \|\mathbf{A} + \mathbf{B}| \]

For the expression in the exponent we can write

\begin{gather*} \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c}= \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}- (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} (\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b})\\ =\mathbf{a}^T\left[\mathbf{A}^{-1} - \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}\right]\mathbf{a}+ \mathbf{b}^T\left[\mathbf{B}^{-1} - \mathbf{B}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{B}^{-1}\right]\mathbf{b}- \mathbf{a}^T\mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{B}^{-1}\mathbf{b}- \mathbf{b}^T\mathbf{B}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{A}^{-1}\mathbf{a} \end{gather*}

Using the property $(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}$ we obtain

\[ \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{B}^{-1}= \left[\mathbf{B}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}\right]^{-1}= (\mathbf{A} + \mathbf{B})^{-1} \]

and

\begin{gather*} \mathbf{A}^{-1} - \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}=\\ \mathbf{A}^{-1}\left[1 - (\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}\right]=\\ \mathbf{A}^{-1}\left[1 - \mathbf{B}(\mathbf{A} + \mathbf{B})^{-1}\mathbf{A}\mathbf{A}^{-1}\right]=\\ \mathbf{A}^{-1}\left[1 - \mathbf{B}(\mathbf{A} + \mathbf{B})^{-1}\right]=\\ \mathbf{A}^{-1}\left[(\mathbf{A} + \mathbf{B}) - \mathbf{B}\right](\mathbf{A} + \mathbf{B})^{-1}=\\ (\mathbf{A} + \mathbf{B})^{-1} \end{gather*}

we thus conclude that

\begin{gather*} c = (2\pi)^{-\frac{D}{2}}|\mathbf{A}+\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}( \mathbf{a} - \mathbf{b})^T(\mathbf{A} + \mathbf{B})^{-1}(\mathbf{a} -\mathbf{b})\right\}=\\ \mathcal{N}(\mathbf{b}|\mathbf{a}, \mathbf{A}+ \mathbf{B})=\\ \mathcal{N}(\mathbf{a}|\mathbf{b}, \mathbf{A}+ \mathbf{B}) \end{gather*}

Multivariate normal distribution, $\mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})$ can be represented as a distribution from an exponential family:

\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})= (2\pi)^{-\frac{D}{2}}|\mathbf{A}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})\right]=\\ (2\pi)^{-\frac{D}{2}} \exp\left[-\frac{1}{2}\text{tr}(\mathbf{A}^{-1}\mathbf{x}\mathbf{x}^T) + \mathbf{a}^T\mathbf{A}^{-1}\mathbf{x}- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}| \right], \end{gather*}

where we used that $\mathbf{a}^T\mathbf{A}^{-1}\mathbf{x} = \mathbf{x}^T\mathbf{A}^{-1}\mathbf{a}$, and also write the first term as

\[ \mathbf{x}^T\mathbf{A}^{-1}\mathbf{x}= \sum_{i,j}x_i (\mathbf{A}^{-1})_{ij} x_j= \sum_{i,j}(\mathbf{A}^{-1})_{ij} x_j x_i= \sum_{i,j}(\mathbf{A}^{-1})_{ij} (\mathbf{x}\mathbf{x}^T)_{ji}= \text{tr}(\mathbf{A}^{-1}\mathbf{x}\mathbf{x}^T) \]

Representing $\mathcal{N}(\mathbf{x}|\mathbf{b},\mathbf{B})$ in a similar way and multiplying the two distributions we readily obtain

\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})\mathcal{N}(\mathbf{x}|\mathbf{b},\mathbf{B})=\\ (2\pi)^{-D} \exp\left\{-\frac{1}{2}\text{tr}\left[(\mathbf{A}^{-1}+ \mathbf{B}^{-1})\mathbf{x}\mathbf{x}^T\right]+ (\mathbf{a}^T\mathbf{A}^{-1}+\mathbf{b}^T\mathbf{B}^{-1})\mathbf{x}- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}|- \frac{1}{2}\mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \frac{1}{2}\log|\mathbf{B}| \right\}=\\ c\mathcal{N}(\mathbf{x}|\mathbf{c},\mathbf{C}), \end{gather*}

where we defined

\begin{gather*} \mathbf{C}^{-1} = \mathbf{A}^{-1}+ \mathbf{B}^{-1},\\ \mathbf{c}^T\mathbf{C}^{-1} = \mathbf{a}^T\mathbf{A}^{-1}+\mathbf{b}^T\mathbf{B}^{-1},\\ c = (2\pi)^{-\frac{D}{2}} \exp\left\{\frac{1}{2}\mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} + \frac{1}{2}\log|\mathbf{C}|- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}|- \frac{1}{2}\mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \frac{1}{2}\log|\mathbf{B}| \right\} \end{gather*}

Coefficient $c$ can now be reduced to the required form using the matrix transformations described in part a).

Problem

Solution

The expectation value and the conditional expectation value are given by \[ \mathbb{E}_X[x] = \int x p(x) dx,\\ \mathbb{E}_Y[f(y)] = \int f(y) p(y) dy,\\ \mathbb{E}_X[x|y] = \int x p(x|y) dx \]

We then have

\begin{gather*} \mathbb{E}_Y\left[\mathbb{E}_X[x|y]\right] = \int \mathbb{E}_X[x|y] p(y) dy\\ = \int \left[\int xp(x|y)dx\right]p(y) dy\\ = \int \int xp(x|y)p(y)dx dy\\ = \int\int xp(x,y)dxdy\\ = \int x\left[\int p(x,y) dy\right] dx\\ = \int x p(x) dx\\ = \mathbb{E}_X[x], \end{gather*}

where we used the definition fo the conditional probability density

\[ p(x|y)p(y) = p(x,y) \]

Problem

Solution

If $\mathbf{x}$ is fixed, then $\mathbf{y}$ has the same distribution as $\mathbf{w}$, but with the mean shifter by $\mathbf{A}\mathbf{x} + \mathbf{b}$, that is \[ p(\mathbf{y}|\mathbf{x}) = \mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{x} + \mathbf{b}, \mathbf{Q}) \]

Let us consider random variable $\mathbf{u} = \mathbf{A}\mathbf{x}$, it is distributed according to

\[ p(\mathbf{u}) = \mathcal{N}(\mathbf{u}|\mathbf{A}\mathbf{\mu}_x, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T). \]

Then $\mathbf{y}$ is a sum of two Gaussian random variables $\mathbf{u}$ and $\mathbf{w}$ with its mean additionally shifted by $\mathbf{b}$, that is

\[ p(\mathbf{y}) = \mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{\mu}_x + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q}), \]

that is

\[ \mathbf{\mu}_y = \mathbf{A}\mathbf{\mu}_x + \mathbf{b},\\ \mathbf{\Sigma}_y = \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q}. \]

Like in b), assuming that $\mathbf{y}$ is fixed we obtain the conditional distribution

\[ p(\mathbf{z}|\mathbf{y}) = \mathcal{N}(\mathbf{z}|\mathbf{C}\mathbf{y}, \mathbf{R}) \]

Since $\mathbf{C}\mathbf{y}$ is a Gausssian random variable with distribution $\mathcal{N}(\mathbf{C}\mathbf{\mu}_y, \mathbf{C}\mathbf{\Sigma}_y\mathbf{C}^T)$ we obtain the distribution of $\mathbf{z}$ as that of a sum of two Gaussian random variables:

\[ p(\mathbf{z})= \mathcal{N}(\mathbf{z} |\mathbf{C}\mathbf{\mu}_y, \mathbf{C}\mathbf{\Sigma}_y\mathbf{C}^T + \mathbf{R})= \mathcal{N}(\mathbf{z} |\mathbf{C}(\mathbf{A}\mathbf{\mu}_x + \mathbf{b}), \mathbf{C}(\mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q})\mathbf{C}^T + \mathbf{R}) \]

The posterior distribution $p(\mathbf{x}|\mathbf{y})$ can be obtained by applying the Bayes’ theorem:

\[ p(\mathbf{x}|\mathbf{y})= \frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}= \frac{\mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{x} + \mathbf{b}, \mathbf{Q})\mathcal{N}(\mathbf{x}|\mathbf{\mu}_x,\mathbf{\Sigma}_x)} {\mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{\mu}_x + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q})} \]

Problem

Solution

Cdf is related to pdf as \[ F_x(x) = \int_{-\infty}^xdx’ f_x(x’),\\ \frac{d}{dx} F_x(x) = f_x(x) \]

and changes in the interval $[0,1]$.

The pdf of variable $y=F_x(x)$ then can be defined as

\[ f_y(y) = f_x(x) \left|\frac{dx}{dy}\right| = \frac{f_x(x)}{\left|\frac{dy}{dx}\right|} = \frac{f_x(x)}{\left|\frac{dF_x(x)}{dx}\right|} = \frac{f_x(x)}{f_x(x)} = 1, \]

i.e. $y$ is uniformly distributed in interval $[0,1]$.

Continuous Optimization

Problem

Solution

The first and the second derivatives of function ‘f(x)’ are given by \[ f’(x) = 3x^2 + 12 x -3,\\ f’’(x) = 6x \]

To find the positions of maxima and minima we solve equation for the points where the first derivative takes zero values:

\[ 3x^2 + 12 x -3 = 0 \Rightarrow x_{1,2} = -2 \pm \sqrt{5} \]

Substituting each of the roots into the second derivative we obtain that

\[ f’’(x_1) = f’’(-2 + \sqrt{5}) > 0,\\ f’’(x_2) = f’’(-2 - \sqrt{5}) < 0, \]

i.e. $x_1 = -2 + \sqrt{5}$ is the minimum, whereas $x_2 = -2 - \sqrt{5}$ is the maximum of function $f(x)$.

Problem

Solution

The full batch update rule for gradient descent is \[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i\sum_{n=1}^{N}\left(\nabla L_n(\mathbf{\theta}_i)\right)^T \]

For a mini-batch of size $1$ at every step we would choose only one of the data points, i.e. we would choose randomly $n$ among the values $1…N$ and calculate the update as

\[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i\left(\nabla L_n(\mathbf{\theta}_i)\right)^T \]

Problem

Solution

True. Indeed, any two points in the intersection of the two convex sets can be connected by a path, which belongs to each of the sets (since they are convext), and therefore is also a part of the intersection. False. Indeed, if the two sets are completely disjoint, their union may not contain the points joining a point of one set to a point of the other.

False. Subtracting one set from the other may remove some of the points that belong to the paths connecting the points of the remaining set.

Problem

Solution

True. If $f(\mathbf{x})$ and $g(\mathbf{x})$ are convex functions, it means that for any $\theta \in[0,1]$: \[ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}),\\ g(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta g(\mathbf{x}) + (1-\theta)g(\mathbf{y}). \]

By adding these inequalities we immediately obtain the condition of convexity for their sum, $h(\mathbf{x}) = f(\mathbf{x}) + g(\mathbf{x})$:

\begin{gather*} h(\theta \mathbf{x} + (1-\theta)\mathbf{y})=\\ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) + g(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}) + \theta g(\mathbf{x}) + (1-\theta)g(\mathbf{y})=\\ \theta h(\mathbf{x}) + (1-\theta)h(\mathbf{y}) \end{gather*}

False. For example; the difference of $f(x) = ax^2$ and $g(x) = bx^2$ is convex if $a \geq b$, but concave otherwise

False. For example the product of two convex functions $f(x) = (x-a)^2$ and $g(x) = (x+a)^2$, $h(x) = (x-a)^2(x+a)^2$ has two minima and a maximum at $x=0$. For examlple, if we take $x=a$, $y=-a$ and $\theta = 0.5$, then

\[ h(\theta x + (1-\theta)y) = h(0) = a^4 > 0 = 0.5 h(a) + 0.5 h(-a) = \theta h(x) + (1-\theta)h(y), \]

i.e. the condition fo convexity is not satisfied.

The maximum of a function is not a function, so it does not have property of convexity. If however we talk about the maximum as a convext set, consisting of only one point, then it is trivially convex.

Problem

Solution

We first introduce vectors $\mathbf{y} = (x_0, x_1, \xi)^T$ and $\mathbf{c} = (p_0, p_1, 1)^T$, so that $\mathbf{p}^T\mathbf{x} + \xi = \mathbf{c}^T\mathbf{y}$.

We then introduce vector $\mathbf{b} = (0, 3, 0)$ and matrix

\[ \mathbf{A} =\begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & -1 \end{bmatrix}\]

which allow us to write the three constraints as $\mathbf{A}\mathbf{y} \leq \mathbf{b}$.

We now can write the problem as a standard linear program:

\begin{gather*} \max_{\mathbf{y} \in \mathbb{R}^3} \mathbf{c}^T\mathbf{y}\\ \text{subject to } \mathbf{A}\mathbf{y} \leq \mathbf{b} \end{gather*}

Problem

Solution

Let us define $\mathbf{c} = (-5, -3)^T$, $\mathbf{b} = (33, 8, 5, -1, 8)^T$ and \[ \mathbf{A}=\begin{bmatrix} 2 & 2\\ 2 & -4\\ -2 & 1\\ 0 & -1\\ 0 & 1 \end{bmatrix}\]

The linear program is then written as

\begin{gather*} \min_{\mathbf{x} \in \mathbb{R}^2} \mathbf{c}^T\mathbf{x}\\ \text{subject to } \mathbf{A}\mathbf{x} \leq \mathbf{b} \end{gather*}

The Lagrangian of this problem is

\[ \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = \mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b})= (\mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T\mathbf{A})\mathbf{x} - \mathbf{\lambda}^T\mathbf{b}= (\mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda})^T\mathbf{x} - \mathbf{\lambda}^T\mathbf{b} \]

Taking gradient in respect to $\mathbf{x}$ and setting it to zero we obtain the extremum condition

\[ \mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda} = 0, \]

that is

\[ \mathcal{D}(\mathbf{\lambda}) = \min_{\mathbf{x} \in \mathbb{R}^2} \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = - \mathbf{\lambda}^T\mathbf{b} \]

that is the dual problem is given by

\begin{gather*} \max_{\mathbf{\lambda} \in \mathbb{R}^5} - \mathbf{b}^T\mathbf{\lambda}\\ \text{subject to } \mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda} = 0 \text{ and } \mathbf{\lambda} \geq 0 \end{gather*}

In terms of the original values of the parameters it can be thus written as

\[\max_{\mathbf{\lambda} \in \mathbb{R}^5} - \begin{bmatrix}33 \\ 8 \\ 5 \\ -1 \\ 8\end{bmatrix}^T\begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5\end{bmatrix}\\ \text{subject to } -\begin{bmatrix}5 \\ 3\end{bmatrix}+\begin{bmatrix}2 & 2 & -2 & 0 & 0\\2 & -4 & 1 & -1 & 1\end{bmatrix}\begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5 \end{bmatrix} = 0\\text{and } \begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5\end{bmatrix} \geq 0\]

Problem

Solution

We introduce $\mathbf{Q} = \begin{bmatrix} 2 & 1 \\ 1 & 4 \end{bmatrix}$, $\mathbf{c} = \begin{bmatrix} 5\\3\end{bmatrix}$, \[\mathbf{A} = \begin{bmatrix} 1 & 0\\ -1 & 0\\ 0 & 1\\ 0 & -1 \end{bmatrix}\] and \[\mathbf{b}=\begin{bmatrix}1\\1\\1\\1\end{bmatrix}\]

Then the quadratic problem takes form

\begin{gather*} \min_{\mathbf{x} \in \mathbb{R}^2} \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x}\\ \text{subject to } \mathbf{A}\mathbf{x} \leq \mathbf{b} \end{gather*}

The Lagrangian corresponding to this problem is

\[ \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b})= \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + (\mathbf{c}^T+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{x} - \mathbf{\lambda}^T\mathbf{b} \]

where $\mathbf{\lambda} = \begin{bmatrix}\lambda_1\\lambda_2\\lambda_3\\lambda_4\end{bmatrix}$ We minimize the Lagrangian by setting its gradient to zero, which results in

\[ \mathbf{Q}\mathbf{x} + \mathbf{c}+\mathbf{A}^T\mathbf{\lambda} = 0 \Rightarrow \mathbf{x} = -\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda}) \]

Substituting this back into the Lagrangian we obtain

\[ \mathcal{D}(\mathbf{\lambda})= \min_{\mathbf{x} \in \mathbb{R}^2} \mathcal{L}(\mathbf{x},\mathbf{\lambda}) =-(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})- \mathbf{\lambda}^T\mathbf{b} \]

The dual problem is now

\[ \max_{\mathbf{\lambda} \in \mathbb{R}^4} -(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})- \mathbf{\lambda}^T\mathbf{b}\\ \text{subject to } \mathbf{\lambda} \geq 0, \] where the parameter vectors and matrices are defined above.

Problem

Solution

The primal problem can be written in the standard form as

\begin{gather*} \min_{\mathbf{w}\in\mathbb{R}^D} \frac{1}{2}\mathbf{w}^T\mathbf{w}\\ \text{subject to } 1-\mathbf{x}^T\mathbf{w} \leq 0 \end{gather*}

The Lagrangian is then

\[ \mathcal{L}(\mathbf{w},\lambda)= \frac{1}{2}\mathbf{w}^T\mathbf{w} + \lambda(1-\mathbf{x}^T\mathbf{w}) \]

Taking gradient in respect to $\mathbf{w}$ we obtain the position of the minimum: \[ \mathbf{w} = \lambda \mathbf{x} \]

Thus, the dual Lagrangian is

\[\mathcal{D}(\lambda)= \min_{\mathbf{w}\in\mathbb{R}^D}\mathcal{L}(\mathbf{w},\lambda)=-\frac{\lambda^2}{2}\mathbf{x}^T\mathbf{x} +\lambda \]

Problem

Solution

Assuming standard dot product, convex conjugate is defined as \[ f^*(\mathbf{s}) = \sup_{\mathbf{x}\in\mathbb{R}^D} \mathbf{s}^T\mathbf{x} - f(\mathbf{x}) \]

Since function $f(\mathbf{x})$ is continuous, differentiable and convex, looking for the supremum is equivalent to looking for the maximum, and can be done by setting the following gradient to zero:

\[ \nabla_{\mathbf{x}}\left(\mathbf{s}^T\mathbf{x} - f(\mathbf{x})\right) = \mathbf{s}^T - \nabla_\mathbf{x}f(\mathbf{x})=0 \]

that is \[ s_d = \frac{\partial}{\partial x_d}f(\mathbf{x}) = \log x_d - 1 \Rightarrow x_d = e^{s_d + 1} \]

Then

\[ f^*(\mathbf{s})= \sum_{d=1}^Ds_dx_d - \sum_{d=1}^Dx_d\log x_d= \sum_{d=1}^Ds_d e^{s_d+1} - \sum_{d=1}^De^{s_d+1}(s_d + 1)=-\sum_{d=1}^De^{s_d+1} \]

Problem

Solution

Without loss of generality we can assume that $\mathbf{A} = \mathbf{A}^T$ is a symmetric matrix, as well as its inverse. By setting to zero gradient \[ \nabla_{\mathbf{x}}\left(\mathbf{s}^T\mathbf{x} - f(\mathbf{x})\right)= \mathbf{s}^T - \mathbf{x}^T\mathbf{A} - \mathbf{b}= 0 \]

that is

$\mathbf{s} = \mathbf{A}\mathbf{x} + \mathbf{b} \Leftrightarrow \mathbf{x} = \mathbf{A}^{-1}(\mathbf{s} - \mathbf{b})$

Therefore

\[ f^*(\mathbf{s}) = \frac{1}{2}(\mathbf{s} - \mathbf{b})\mathbf{A}^{-1}(\mathbf{s} - \mathbf{b}) - c \]

Problem

Solution

By definition of convex conjugate: \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}} \beta\alpha - \max\{0, 1-\alpha\}=\sup_{\alpha \in \mathbb{R}}\begin{cases}\beta\alpha, \text{ if } \alpha > 1\\beta\alpha -1 + \alpha, \text{ if } \alpha< 1 \end{cases}=\sup_{\alpha \in \mathbb{R}}\begin{cases}\beta\alpha, \text{ if } \alpha > 1\(\beta+1)\alpha -1, \text{ if } \alpha< 1\end{cases}\]

We must here distinguish three cases:

a) if $\beta>0$, then $\beta\alpha$ and $(\beta + 1)\alpha -1$ are both increasing with $\alpha$, and take their maximum values at the maximum posible value of $\alpha$, that is \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}+\infty, \text{ if } \alpha > 1\\beta, \text{ if } \alpha< 1\end{cases}=+\infty\]

b) if $\beta < -1$ then then $\beta\alpha$ and $(\beta + 1)\alpha -1$ are both decreasing with $\alpha$, and take their maximum values at the minimum possible value of $\alpha$, that is \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}\beta, \text{ if } \alpha > 1\+\infty, \text{ if } \alpha< 1\end{cases}=+\infty\]

c) Finally, if $-1 \leq \beta \leq 0$, then $\beta\alpha$ is decreasing with $\alpha$, whereas $(\beta + 1)\alpha -1$ is growing, i.e.\[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}\beta, \text{ if } \alpha > 1\\beta, \text{ if } \alpha< 1\end{cases}=\beta\]

We thus obtain

\[L^*(\beta) =\begin{cases} \beta \text{ if } -1\leq \beta \leq 0\+\infty \text{ otherwise} \end{cases}\]

Let us now define function $M(\alpha) = L^*(\beta) + \frac{\gamma}{2}\beta^2$ and calculate its convex conjugate function (we assume that $\gamma> 0$):

\[ M^*(\alpha)= \sup_{\beta \in \mathbb{R}} \alpha\beta - L^*(\beta) - \frac{\gamma}{2}\beta^2= \sup_{\beta \in [-1, 0]} \alpha\beta - L^*(\beta) - \frac{\gamma}{2}\beta^2= \sup_{\beta \in [-1, 0]} \alpha\beta - \beta - \frac{\gamma}{2}\beta^2, \]

where the second equality is because outside of interval $[-1,0]$ we have $-L^*(\beta) = -\infty$.

In interval $[-1,0]$ parabolic function $\alpha\beta - \beta - \frac{\gamma}{2}\beta^2 = - \frac{\gamma}{2}\left(\beta-\frac{\alpha-1}{\gamma}\right)^2 +\frac{(1-\alpha)^2}{2\gamma}$ has a maximim at $\beta = \frac{\alpha-1}{\gamma}$, if this point is inside this interval, and at one of the interval edges otherwise. More precisely

\[M(\alpha)=\begin{cases} 0, \text{ if } \alpha > 1\\ \frac{(1-\alpha)^2}{2\gamma}, \text{ if } 0\leq\alpha\leq 1\\ 1-\alpha-\frac{\gamma}{2}, \text{ if } \alpha > 1 \end{cases}\]

Mathematics for Machine Learning

Linear Algebra

Analytic Geometry

Matrix Decompositions

Vector Calculus

Probability and Distributions

Continuous Optimization

Backlinks (3)

1. Reading List `/words/reading-list/`

2025

2026

2027

2028

2029

2. Wiki `/wiki/`

3. Books `/words/library/books/`

Mathematics for Machine Learning

Linear Algebra#

Analytic Geometry#

Matrix Decompositions#

Vector Calculus#

Probability and Distributions#

Continuous Optimization#

Backlinks (3)

1. Reading List /words/reading-list/

2025#

2026#

2027#

2028#

2029#

2. Wiki /wiki/

3. Books /words/library/books/

Linear Algebra

Analytic Geometry

Matrix Decompositions

Vector Calculus

Probability and Distributions

Continuous Optimization

1. Reading List `/words/reading-list/`

2025

2026

2027

2028

2029

2. Wiki `/wiki/`

3. Books `/words/library/books/`