Mathematics for Machine Learning
credit for these solutions goes to: https://github.com/ilmoi/MML-Book
I have simply appropriated the content and fixed some minor typos. the intellectual property remains theirs.
Linear Algebra
Solution
We have to check the five properties listed in Definition 2.7. Firstly, let’s look at closure. Our operation should take two elements of \(\mathbb{R}\setminus\{-1\}\), and spit out a real number which isn’t \(-1\). Clearly we will get a real answer, so let’s suppose we can get \(-1\), and we’ll hopefully find a contradiction. Let \(a,b \in \mathbb{R}\setminus\{-1\}\) be such that \(-1 = ab + a+ b\). Then \(ab+a+b+1 = 0\), so \((a+1)(b+1) = 0\). Therefore, either \(a=-1\) or \(b=-1\), which is a contradiction. Now, we attack associativity. We use the fact that \(+\) is associative and commutative on \(\mathbb{R}\). We have
\begin{aligned} (a\star b) \star c & = (ab+a+b) \star c \\ & = (ab+a+b)c + (ab+a+b) + c \\ & = abc + ac + bc + ab + a + b + c \\ & = abc + ab + ac + a + bc + b + c \\ & = a(bc+b+c) + a + (bc+b+c) \\ & = a(b\star c)+a + (b\star c) \\ & = a\star (b\star c). \end{aligned}
Observe that \(a\star 0 = a \cdot 0 +a+0 = a\), and similarly \(0\star a = a\), so \(0\) is the neutral element.
Fourthly, we take \(x \star y = 0\) (the neutral element we found in the previous step). Solving for \(y\) in terms of \(x\), we find that $ y = \frac{-x}{x+1}, $ which always exists, since \(x\neq -1\). Thus such a \(y\) is the inverse element for \(x \in \mathbb{R}\setminus\{-1\}\). This concludes the argument that \((\mathbb{R}\setminus\{-1\},\star)\) is a group.
Finally, if we observe that \(a\star b = b \star a\), by the commutativity of multiplication and addition of real numbers, then we can conclude that \((\mathbb{R}\setminus\{-1\},\star)\) is an \(abelian\) group.
Solution
Let’s compute the left hand side as \(3\star (x \star x)\), say. We have \(3 \star (x^2+2x)\), and thus \(3(x^2+2x)+(x^2+2x)+3 = 15\). From this, we obtain \(4x^2+8x-12 = 0\), so \(x=1\) or \(x=-3\). You can check that both of these do indeed work in the original equation.
Solution
The major thing we need to check here is whether the operation \(\oplus\) is well-defined. That is to say, if we take two different representatives of the same congruence classes, do we definitely get the same answer at the end? Let’s take $a_1, a_2 ∈ \mathbb{Z} $ such that \(\overline{a_1} = \overline{a_2}\), and similarly let $b_1, b_2 ∈ \mathbb{Z} $ such that \(\overline{b_1} = \overline{b_2}\). We need to show that \(\overline{a_1}\oplus \overline{b_1} = \overline{a_2} \oplus \overline{b_2}\). In other words, we need to show that \(\overline{a_1+b_1} = \overline{a_2+b_2}\). By the definition of the congruence class, two classes \(\overline{a_1}\) and \(\overline{a_2}\) are equal in \(\mathbb{Z}_n\) if and only if \(a_1 - a_2\) is a multiple of \(n\). Let’s say \(a_1 - a_2 = k_1 n\) and \({b_1 - b_2 = k_2 n}\), for some \(k_1, k_2 \in \mathbb{Z}\). Observe then that \((a_1+b_1)-(a_2+b_2) = (k_1+k_2)n\), and so \(\overline{a_1+b_1} = \overline{a_2+b_2}\) indeed!
From here, we use properties of addition in the integers to deduce that \((\mathbb{Z}_n,\oplus)\) is an abelian group.
\(\overline{a}\oplus\overline{b} = \overline{a+b}\in \mathbb{Z}_n\)
\((\overline{a}\oplus \overline{b}) \oplus \overline{c} = \overline{a+b} \oplus \overline{c} = \overline{(a+b)+c} = \overline{a+(b+c)} = \overline{a} \oplus \overline{b+c} = \overline{a}\oplus (\overline{b} \oplus \overline{c})\), where the middle equality follows from the associativity of \(\mathbb{Z}\).
\(\overline{0}\) is the neutral element, since \(\overline{0} \oplus \overline{a} = \overline{0+a} = \overline{a}\), and similarly the other way round.
Given \(\overline{a} \in \mathbb{Z}_n\), we have that \(\overline{-a} = \overline {n-a}\) is the inverse element, since \(\overline{a} \oplus \overline{-a} = \overline{a-a} = \overline{0}\), indeed.
Finally, \(\overline{a}\oplus \overline{b} = \overline{a+b} = \overline{b+a} = \overline{b} \oplus \overline{a}\), so the group is abelian!
Solution
Observe that the neutral element is \(\overline{1}\). For the inverses, we have \(\overline{1}^{-1} = \overline{1}\), \(\overline{2}^{-1} = \overline{3}\), \(\overline{3}^{-1} = \overline{2}\), and \(\overline{4}^{-1} = \overline{4}\). All the axioms are satisfied, so \((\mathbb{Z}_5\setminus\{\overline{0}\} , \otimes)\) is a group.
Solution
Observe that, for example, \(\overline{2}\otimes \overline{4} = \overline{0}\notin \mathbb{Z}_8\setminus\{\overline{0}\}\), so the operation is not closed.
Solution
If \(n\) is prime, then every integer from \(1\) to \(n-1\) will be relatively prime to \(n\). Let \(a\) be such an integer. Then, by B\’ezout’s theorem, there exist integers \(u, v\) such that \(au+nv=1\). If we take this identity mod \(n\), we have \(\overline{au} \oplus \overline{nv} = \overline{1}\). But \(nv\) is a multiple of \(n\), so \(\overline{nv} = \overline{0}\), and this simplifies to \(\overline{a} \otimes \overline{u} = \overline{1}\) – in other words, \(\overline{u}\) is the inverse of \(\overline{a}\). Also, if we take two integers \(a\) and \(b\), both of which are relatively prime to \(n\), then the product \(ab\) also cannot contain a factor of \(n\), since \(n\) is prime, so the operation is closed. The other properties follow immediately. This shows that we have a group. \(\lozenge\) If \(n\) is not prime, then we can say \(n=ab\), where \(a,b\in \mathbb{Z}\), with \(1<a\leq b<n\). But then \(\overline{a}\otimes \overline{b} = \overline{ab} = \overline{n} = \overline{0}\notin \mathbb{Z}_n\setminus\{\overline{0}\}\), and so the operation is not closed.
Let \(A_1 = \begin{bmatrix} 1 & x_1 & y_1 \\ 0 & 1 & z_1 \\ 0 & 0 & 1 \end{bmatrix}\) and \(A_2 = \begin{bmatrix} 1 & x_2 & y_2 \\ 0 & 1 & z_2 \\ 0 & 0 & 1 \end{bmatrix}\in \mathcal{G}\).
Then \(A_1 A_2 =\begin{bmatrix} 1 & x_1+x_2 & y_1+x_1z_2+y_2 \\ 0 & 1 & z_1+z_2 \\ 0 & 0 & 1 \end{bmatrix}\in\mathcal{G}\), so we have closure.
Associativity follows from the associativity of standard matrix multiplication.
Letting \(x=y=z=0\), observe that the identity is in \(\mathcal{G}\).
Finally, if we take \(x_2 = -x_1\), \(z_2 = -z_1\), and \(y_2 = -y_1-x_1z_2\), then observe that \(A_1A_2 = I_3\), and thus inverses are of the required form! Therefore, $\mathcal{G} $ is a group.
The group is not abelian, e.g. take \(x_1=z_2=1\) and everything else to be \(0\). Then multiplying these matrices in the other order (i.e. \(x_2=z_1=1\)) gives a different answer.
Solution
The numbers of columns in the first matrix must equal the number of rows in the second, so this product is not defined.
Solution
\[\begin{bmatrix} 4&3&5\\ 10&9&11\\ 16&15&17 \end{bmatrix}\]
Solution
\[\begin{bmatrix} 5&7&9\\ 11&13&15\\ 8&10&12 \end{bmatrix}\]
Solution
\[\begin{bmatrix} 14&6\\ -21&2 \end{bmatrix}\]
Solution
\[\begin{bmatrix} 12&3&-3&-12\\ -3&1&2&6\\ 6&5&1&0\\ 13&12&3&2 \end{bmatrix}\]
Solution
Let’s do some Gaussian elimination. We start with \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 2&5&-7&-5&-2\\ 2&-1&1&3&4\\ 5&2&-4&2&6 \end{array}\right]\].
Taking \(r_2-2r_1\), \(r_3-2r_1\) and \(r_4-5r_1\), we have \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&-3&3&5&2\\ 0&-3&1&7&1 \end{array}\right]\].
Now taking \(r_3+r_2\) and \(r_4+r_2\), we have \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&0&-2&2&-2\\ 0&0&-4&4&-3 \end{array}\right]\].
Finally, taking \(r_4-2r_3\), we have \[\left[ \begin{array}{cccc|c} 1&1&-1&-1&1\\ 0&3&-5&-3&-4\\ 0&0&-2&2&-2\\ 0&0&0&0&1 \end{array}\right]\].
Observe that the rank of the augmented matrix (4) is greater than rank of matrix of coefficients (3), so the system has no solution.
Solution
We proceed with some more Gaussian elimination. From the start, we do \(r_2-r_1\), \(r_3-2r_1\), and \(r_4+r_1\), to obtain \[\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&2&0&-3&-1&3\\ 0&1&0&1&-3&-1\\ 0&1&0&-2&0&2 \end{array}\right]\].
From here, do \(r_2-2r_3\) and \(r_4-r_3\) to obtain \[\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&0&0&-5&5&5\\ 0&1&0&1&-3&-1\\ 0&0&0&-3&3&3 \end{array}\right]\].
Next, if we divide \(r_2\) by \(-5\), and then do \(r_4+3r_2\), we have \[\left[ \begin{array}{ccccc|c} 1&-1&0&0&1&3\\ 0&0&0&1&-1&-1\\ 0&1&0&1&-3&-1\\ 0&0&0&0&0&0 \end{array}\right]\].
Now, we do \(r_1+r_3\), then swap \(r_2\) and \(r_3\), to give \[\left[ \begin{array}{ccccc|c} 1&0&0&1&-2&2\\ 0&1&0&1&-3&-1\\ 0&0&0&1&-1&-1\\ 0&0&0&0&0&0 \end{array}\right]\].
Finally, let’s do \(r_1-r_3\) and \(r_2-r_3\), to obtain \[\left[ \begin{array}{ccccc|c} 1&0&0&0&-1&3\\ 0&1&0&0&-2&0\\ 0&0&0&1&-1&-1\\ 0&0&0&0&0&0 \end{array}\right]\].
Let’s turn these back into equations. We have \(x_1-x_5=3\), \(x_2-2x_5=0\) and \(x_4-x_5=-1\). Thus we can take \(x_3\) and \(x_5\) to be arbitrary, and then the others are determined. This gives a solution set of \(\{ (\alpha+3, 2\alpha, \beta, \alpha-1, \alpha)^{\mathsf{T}} : \alpha, \beta \in \mathbb{R} \}\).
We start with doing \(r_3-r_1\), to give us \[\left[ \begin{array}{cccccc|c} 0&1&0&0&1&0&2\\ 0&0&0&1&1&0&-1\\ 0&0&0&0&-1&1&-1 \end{array}\right]\].
Next, do \(r_1+r_3\), \(r+2+r_3\), and then multiply \(r_3\) by \(-1\). This gives \[\left[ \begin{array}{cccccc|c} 0&1&0&0&0&1&1\\ 0&0&0&1&0&1&-2\\ 0&0&0&0&1&-1&1 \end{array}\right]\].
This corresponds to the equations \(x_2+x_6=1\), \(x_4+x_6=-2\), and \(x_5-x_6 = 1\). Now we can take \(x_1\), \(x_3\) and \(x_6\) to be arbitrary, giving a solution set of \(\{ (\alpha, 1-\beta, \gamma, -2-\beta, 1+\beta, \beta)^{\mathsf{T}} : \alpha,\beta, \gamma \in \mathbb{R} \}\).
Here, we are looking for eigenvectors associated with the eigenvalue \(12\). Indeed, we don’t yet know that \(12\) is an eigenvalue, but if we find ourselves unable to find such an eigenvector, then we can answer that the solution set will be empty. We have that \(A-12I = \begin{bmatrix}-6&4&3\\ 6&-12&9\\ 0&8&-12 \end{bmatrix}\), and observe that \((3,3,2)^{\mathsf{T}} \in \ker(A-12I)\), so this is an eigenvector (and \(12\) is an eigenvalue indeed!). Finally, we want an eigenvector such that the sum of the components is \(1\), so simply divide the eigenvector above by \(3+3+2=8\).
Solution
Observe that \(\det A = 2(24-25) -3(18-20) + 4(15-16) = -2+6-4=0\), so \(A\) is not invertible.
Solution
We perform Gaussian elimination on \[\left[ \begin{array}{cccc|cccc} 1&&1&&1&&&\\ &1&1&&&1&&\\ 1&1&&1&&&1&\\ 1&1&1&&&&&1 \end{array}\right]\], where a blank space denotes a 0.
Firstly, with \(r_3-r_1\), \(r_4-r_1\), \(r_3-r_2\), and \(r_4-r_2\), we have \[\left[ \begin{array}{cccc|cccc} 1&&1&&1&&&\\ &1&1&&&1&&\\ &&-2&1&-1&-1&1&\\ &&-1&&-1&-1&&1 \end{array}\right]\].
Then with \(r_1+r_4\), \(r_2+r_4\), and \(r_3-2r_4\), we have \[\left[ \begin{array}{cccc|cccc} 1&&&&&-1&&1\\ &1&&&-1&&&1\\ &&&1&1&1&1&-2\\ &&-1&&-1&-1&&1 \end{array}\right]\].
Finally, swapping \(r_3\) and \(r_4\), then multiplying \(r_3\) by \(-1\), we have \[\left[ \begin{array}{cccc|cccc} 1&&&&&-1&&1\\ &1&&&-1&&&1\\ &&1&&1&1&&-1\\ &&&1&1&1&1&-2 \end{array}\right]\].
The matrix to the right of the vertical line is the inverse of \(A\).
Solution
We can relabel \(\mu^3\) as \(\nu\), so \(\nu\) can be any real number, and then we have \(A = \{ (\lambda, \lambda+\nu, \lambda-\nu)^{\mathsf{T}} : \lambda, \nu \in \mathbb{R} \}\). This has a basis of \(\{(1,1,1)^{\mathsf{T}}, (0,1,-1)^{\mathsf{T}}\}\), so it is a subspace of \(\mathbb{R}^3\).
Solution
We cannot do the same trick as before, since the square of a real number is always at least zero. Clearly \((1,-1,0)^{\mathsf{T}}\in B\), but \(-1\) times this vector, i.e. \((-1,1,0)^{\mathsf{T}}\notin B\), and thus \(B\) is not a subspace.
Solution
We know that \((0,0,0)^{\mathsf{T}}\) is an element of every (three-dimensional!) subspace, so we can \(C\) can only be a subspace if \(\gamma = 0\). In this case, we can find a basis for \(C\) (say \(\{ (3,0,-1)^{\mathsf{T}},(0,3,2)^{\mathsf{T}} \}\)), and conclude that it is indeed a subspace.
Solution
Again, this is not a subspace. Observe that \((0,1,0)^{\mathsf{T}}\in D\), so if \(D\) were a subspace, then any (real!) multiple should be in \(D\) also. However, \(\frac12 (0,1,0)^{\mathsf{T}} \notin D\).
Solution
If we form a matrix out of these three vectors and compute its determinant, we get zero. Thus, the vectors are not linearly independent.
Solution
Let’s form the equation \(\alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 = 0\). These three vectors are linearly independent if and only if the \(only\) solution to this equation is \(\alpha_1=\alpha_2=\alpha_3=0\). Looking at the third component, we have that \(\alpha_1\cdot 1 + \alpha_2 \cdot 0 + \alpha_3 \cdot 0 = 0\), that is to say, \(\alpha_1 = 0\).
Next, look at the second component. We already know \(\alpha_1 = 0\), so we have \(\alpha_2 \cdot 1 + \alpha_3 \cdot 0 = 0\), that is to say \(\alpha_2=0\) also.
Finally, look at the first component. We have that \(\alpha_3 \cdot 1 = 0\), so all of the \(\alpha_i\)’s are zero. Therefore, our three vectors are linearly independent.
Here, we need to solve the simultaneous equations \(\alpha_1+\alpha_2+2\alpha_3 = 1\), \(\alpha_1+2\alpha_2-\alpha_3=-2\), and \(\alpha_1+3\alpha_2+\alpha_3 = 5\). Performing Gaussian elimination in the usual way, we determine that \(\alpha_1 = -6\), \(\alpha_2 = 3\), and \(\alpha_3 = 2\). That is to say, \(y=-6x_1+3x_2+2x_3\).
We write the given vectors as \(v_1, \dots v_6\) from left to right. Firstly, observe that \(\dim(U_1)=2\) and \(\dim(U_2)=2\) (compute the rank of \(\left[v_1|v_2|v_3\right]\), then \(\left[v_4|v_5|v_6\right]\)). Since we can write \(v_3= \frac13 (v_1-2v_2)\) and \(v_6=-v_4-2v_5\), we need not consider \(v_3\) and \(v_6\) any further. Now, if we find the rank of \(\left[v_1|v_2|v_4|v_5\right]\), we get 3, so \(\dim(U_1+U_2)=3\). Therefore, \(\dim(U_1\cap U_2) = 2+2-3=1\). Hence, to find a basis of \(U_1 \cap U_2\), we need only find any non-zero vector in the space.
Let \(0\neq v \in U_1\cap U_2\). Then we can write \(v=\alpha_1v_1 + \alpha_2v_2\), and \(v=\alpha_4v_4 + \alpha_5v_5\). Subtracting these equations, we have \(0=\alpha_1v_1 + \alpha_2v_2 -\alpha_4v_4 - \alpha_5v_5\). Remember we want a non-zero solution for \(v\), and observe that the rank of \(\left[v_1|v_2|v_4\right]\) is 3 (i.e. these three vectors are linearly independent). Hence we can take \(\alpha_5=9\), say (this means we don’t have fractions later, but there’s no way to know this a priori!), and solve for the other \(\alpha_i\)’s. Using Gaussian elimination, we obtain \(\alpha_1=4\), \(\alpha_2=10\) and \(\alpha_4 = -6\). Thus \(\begin{equation*} v=4v_1+10v_2 = -6v_4+9v_5 = \begin{bmatrix} 24\\ -6\\ -12\\ -6 \end{bmatrix}. \end{equation*}\)
Finally, we can write our basis of \(U_1\cap U_2\) as just \(\{ v \}\).
Solution
We can compute \(rank({A_1})=2\) and \(rank({A_2})=2\) also. Since both matrices map from \(\mathbb{R}^3\) to \(\mathbb{R}^4\), we must have that the nullity of both of the matrices is 1. Therefore, \(U_1\) and \(U_2\) both have dimension 1, since they \(are\) the kernels of their respective maps.
Solution
Since the spaces have dimension 1, we are again simply looking for a non-zero vector in each space. Observe that \((1,1,-1)^{\mathsf{T}}\) lies in both spaces, so \(\{ (1,1,-1)^{\mathsf{T}} \}\) is a basis for both.
Solution
From the previous part, we have that \(U_1=U_2\), so \(U_1\cap U_2 = U_1\) also, and again has \(\{ (1,1,-1)^{\mathsf{T}} \}\) as a basis.
Solution
Observe that these matrices are the same as those used in the previous parts, except now our spaces \(U_1\) and \(U_2\) are different. We now have \(\dim (U_1)=rank(A_1)=2\) and \(\dim(U_2) = rank (A_2)=2\).
Solution
We are looking for two linearly independent columns in each of our matrices – we need two non-zero columns, and they can’t be a multiple of each other. For example, the first two columns of each matrix will do as a basis for each space.
Solution
Note that \(rank([A_1|A_2])=3\), i.e. \(\dim(U_1+U_2)=3\) (Note that \([A_1|A_2]\) is just the \(4\times 6\) matrix formed by concatenating \(A_1\) and \(A_2\).) This means that \(\dim (U_1\cap U_2) = \dim(U_1)+\dim(U_2)-\dim(U_1+U_2) = 2+2-3=1\), so again, to find a basis of \(U_1\cap U_2\), we need only find a non-zero vector in the space. We proceed in a similar way to Question 2.12. Firstly, we observe that we can write \(v_3=v_1+v_2\) and \(v_6=v_4+v_5\), so these two vectors can safely be ignored. Secondly, observe that \(\left[v_1|v_2|v_5\right]\) has rank three, so (using the notation of Question 12) if we take \(\alpha_4=1\), say, and solve then we have \(\alpha_1=3\), \(\alpha_2=1\), and \(\alpha_5=0\). In other words, our non-zero vector is \(3v_1+v_2 = v_4 = (3,1,7,3)^{\mathsf{T}}\), and our basis of \(U_1\cap U_2\) is \(\{ (3,1,7,3)^{\mathsf{T}} \}\).
Solution
Firstly, observe that \((0,0,0) \in F\), and \((0,0,0)\in G\) also. Next, we check that adding any two elements of the set does indeed get us another element of the set. Finally, if we multiply any element of the set by any real number, we again get another element of the required form. Thus \(F\) and \(G\) are subspaces indeed!
Solution
A vector in \(F\cap G\) will satisfy both conditions in the sets, so if we put \(G\)’s condition into \(F\)’s, we find \((a-b)+(a+b)-(a-3b) = 0\), from which we have \(a=-3b\). Thus, \(F\cap G = \{ (-4b,-2b,-6b) : b\in \mathbb{R} \} = span[ (2,1,3)]\).
Solution
Doing the same dimensional analysis as the previous three questions, we find that \(F\cap G\) has dimension 1. We have \(F=span[(1,0,1), (0,1,1)]\), and \(G=span[ (1,1,1),(-1,1,-3)]\). Proceeding in the same way as before, we find that \(-4v_1-2v_2 = -3v_3+v_4\), and hence \(\{(-4,-2,-6)\}\) is a basis of \(F \cap G\), which agrees with Part b.
Solution
Observe that \(\Phi(f+g) = \int_a^b (f+g)(x) dx = \int_a^b f(x) + g(x) dx = \int_a^bf(x) dx +\int_a^b g(x) dx = \Phi(f) + \Phi(g)\), and similarly \(\Phi(\alpha f) = \alpha \Phi(f)\), for all real \(\alpha\), so \(\Phi\) (that is to say, definite integration) is indeed linear!
Solution
Similarly to the previous part, we know that differentiation is indeed linear.
Solution
This is not linear – \(\Phi\) doesn’t even map 0 to 0, indeed!
Solution
We know from 2.7.1 that any matrix transformation like this is indeed linear. This comes from distributive properties of matrix multiplication.
Solution
As before, this mapping is also linear. Indeed, this represents a clockwise rotation by \(\theta\) about the origin. (See 3.9.1)
From the coefficients on the right, we have \(A_\Phi = \begin{bmatrix} 3&2&1\\ 1&1&1\\ 1&-3&0\\ 2&3&1 \end{bmatrix}\).
Using Gaussian elimination, we can compute that \(rank(A_\Phi) = 3\). From this we deduce that the kernel is trivial (i.e. only \((0,0,0)\)), and clearly \(Im(\Phi)= \{ (3x_1+2x_2+x_1,x_1+x_2+x_3,x_1-3x_2,2x_1+3x_2+x_3)^{\mathsf{T}} : x_1, x_2, x_3 \in \mathbb{R} \}\). We have \(\dim(\ker(\Phi))=0\), and \(\dim(Im(\Phi))=rank(A_\Phi)=3\).
We have two automorphisms, which means they map linearly and bijectively from the space, \(E\), to itself. The maps therefore both have kernel \(\{0\}\) (by injectivity) and image \(E\) (by surjectivity). From this, we immediately deduce that \(\ker(f)\cap Im(g) = \{0\}\), indeed. Similarly, we deduce that \(g\circ f\) also has kernel \(\{0\}\) and image \(E\), as required. Note that we didn’t need the condition that \(f\circ g = id_E\).
Solution
Note that \(rank(A_\Phi) = 3\), so \(\ker(\Phi)=\{0\}\) and \(Im(\Phi)=\mathbb{R}^3\).
Solution
Let \(P\) be the change of basis matrix from the standard basis of \(B\) to \(\mathbb{R}^3\). Then \(P=\begin{bmatrix} 1&1&1\\ 1&2&0\\ 1&1&0 \end{bmatrix}\).
The matrix \(\overline{A_\Phi}\) is given by \(P^{-1}A_\Phi P = \begin{bmatrix} 6&9&1\\ -3&-5&0\\ -1&-1&0 \end{bmatrix}\).
Solution
Each set \(B\) and \(B’\) has the correct number of (clearly!) linearly independent vectors, so they are both bases of \(\mathbb{R}^2\).
Solution
We write the old basis vectors (\(B’\)) in terms of the new (\(B\)), and then transpose the matrix of coefficients. We have \(b_1’ = 4 b_1 + 6 b_2\), and \(b_2’ = 0b_1 -b_2\). Thus \(P_1 = \begin{bmatrix} 4 & 0\\ 6 & -1 \end{bmatrix}\).
Solution
Let \(M=[c_1|c_2|c_3]\), and observe that \(\det M = 4 \neq 0\), so the vectors are linearly independent. Since \(\mathbb{R}^3\) had dimension 3, and we have three linearly independent vectors, \(C\) must indeed be a basis. Indeed, such an \(M\) is the change of basis matrix from \(C\) to \(C’\) (write the old vectors in terms of the new!) and this is thus the \(P_2\) we require. Thus \(P_2 = \begin{bmatrix} 1&0&1\\ 2&-1&0\\ -1&2&-1 \end{bmatrix}\).
Solution
Observe that by adding the given results, we find that \(\Phi(b_1) = c_1 + 2c_3\); by subtracting, we have \(\Phi(b_2) = -c_1 +c_2 -c_3\). Then \(A_\Phi\) is given by the transpose of the matrix of coefficients, so \(A_\Phi = \begin{bmatrix} 1&-1\\ 0&1\\ 2&-1 \end{bmatrix}\).
Solution
We first need to apply \(P_1\) to change from basis \(B’\) to \(B\). Then \(A_\Phi\) will map us to \((\mathbb{R}^3, C)\), before \(P_2\) will take us to \(C’\). Remember that matrices are acting like functions here, so they are applied to (column) vectors from right to left. Therefore the multiplication we require is \(A’=P_2 A_\Phi P_1\). (This is what part f is asking us to recognise.) We have \(A’ = \begin{bmatrix} 0&2\\ -10&3\\ 12&-4 \end{bmatrix}\).
Solution
\(P_1 \begin{bmatrix}2\\3\end{bmatrix}=\begin{bmatrix}8\\9\end{bmatrix}\). \(A_\Phi \begin{bmatrix}8\\9\end{bmatrix}=\begin{bmatrix}-1\\9\\7\end{bmatrix}\).
\(P_2 \begin{bmatrix}-1\\9\\7\end{bmatrix}=\begin{bmatrix}6\\ -11\\12\end{bmatrix}\).
And observe that \(A’ \begin{bmatrix}2\\3\end{bmatrix}=\begin{bmatrix}6\\ -11\\12\end{bmatrix}\), indeed!
Analytic Geometry
We must show that \(\langle\cdot,\cdot\rangle\) is a bilinear, symmetric, positive definite map. It is a calculation to show that \(\langle \alpha x + \beta x’, y \rangle = \alpha \langle x,y\rangle + \beta \langle x’,y\rangle\), and similarly for the second argument, and so \(\langle\cdot,\cdot\rangle\) is bilinear.
We have \(\langle y,x\rangle = y_1x_1 - (y_1x_2+y_2x_1)+2y_2x_2 = \langle x,y\rangle\) indeed, so we have symmetry.
Finally, \(\langle x,x \rangle = x_1^2 - 2x_1x_2+2x_2^2 = (x_1-x_2)^2 + x_2^2\). Clearly, if \(x=0\), then \(\langle x,x \rangle = 0\). If \(x\neq 0\), then either \(x_2 \neq 0\), in which case \(\langle x,x \rangle > 0\); or \(x_2=0\) with \(x_1 \neq 0\), so \(\langle x,x \rangle > 0\) again (since it’s the sum of two squares of real numbers). Therefore \(\langle \cdot , \cdot \rangle\) is an inner product indeed!
Observe that \(\left\langle \begin{bmatrix} 1\\0 \end{bmatrix} , \begin{bmatrix} 0\\1 \end{bmatrix} \right\rangle = 0\), but \(\left\langle \begin{bmatrix} 0\\1 \end{bmatrix} , \begin{bmatrix} 1\\0 \end{bmatrix} \right\rangle = 1\), so \(\langle \cdot, \cdot \rangle\) is not symmetric, and thus not an inner product. (Observe that the matrix \(A\) is not a symmetric matrix!)
We have \(d(x,y) = \sqrt{\langle(x-y),(x-y)\rangle} = \sqrt{\left\langle\begin{bmatrix} 2\\3\\3 \end{bmatrix},\begin{bmatrix} 2\\3\\3 \end{bmatrix}\right\rangle}\).
Solution
Thus \(d(x,y) = \sqrt {22}\).
Solution
Here, \(d(x,y) = \sqrt{47}\).
We have that the angle \(\omega\) between the two vectors is given by \(\omega = \cos^{-1}\left(\frac{\langle x, y \rangle}{\lVert x\rVert\lVert y\rVert}\right)\), where \(\lVert x \rVert = \sqrt{\langle x,x\rangle}\).
Solution
Here, \(\lVert x \rVert = \sqrt 5\) and \(\lVert y \rVert = \sqrt 2\). Also, \(\langle x,y \rangle = -3\). Thus \(\omega = \cos^{-1}\left(\frac{-3}{\sqrt{10}}\right) = 2.8198\).
Solution
Now, \(\lVert x \rVert = \sqrt {18}\) and \(\lVert y \rVert = \sqrt 7\). Also, \(\langle x,y \rangle = -11\). Thus \(\omega = \cos^{-1}\left(\frac{-11}{\sqrt{126}}\right) = \cos^{-1}\left(\frac{-11\sqrt{14}}{42}\right)\).
Solution
Let \(v_1, \dots, v_4\) be the four vectors defined in the question. Observe that \(rank[v_1|v_2|v_3|v_4]=3\). Moreover, \(rank[v_1|v_2|v_3]=3\), so these three vectors form a basis of \(U\). Let \(B\) be this matrix of basis vectors, i.e. \(B=[v_1|v_2|v_3]\). Now we compute \(B^{\mathsf{T}}B = \begin{bmatrix} 9&9&0\\ 9&16&-14\\ 0&-14&31 \end{bmatrix}\) and \(B^{\mathsf{T}}x = \begin{bmatrix} 9\\23\\ -25 \end{bmatrix}\).
Next, we solve \(B^{\mathsf{T}}B\lambda = B^{\mathsf{T}}x\) for \(\lambda\). Using Gaussian elimination, we obtain \(\lambda = \begin{bmatrix}-3\\ 4\\ 1 \end{bmatrix}\).
Finally, \(\pi_U(x)\) is given by \(B\lambda = \begin{bmatrix} 1\\ -5\\ -1\\ -2\\3 \end{bmatrix}\).
Solution
By construction, \(d(x,U) = d(x,\pi_U(x))\). This is given by \(\lVert x-\pi_U(x) \rVert = \begin{Vmatrix}-2\\ -4\\0\\6\\ -2 \end{Vmatrix}\).
This norm is given by the square root of the dot product of the vector with itself, so \(d(x,U) = \sqrt{60}=2\sqrt{15}\).
Solution
We require \(\langle e_1, e_2-\pi_U(e_2)\rangle = 0\) and \(\langle e_3, e_2-\pi_U(e_2)\rangle = 0\). That is to say, \(\left\langle \begin{bmatrix} 1\\0\\0 \end{bmatrix}, \begin{bmatrix}-\lambda_1\\1\\ -\lambda_2 \end{bmatrix} \right\rangle = 0\), and \(\left\langle \begin{bmatrix} 0\\0\\1 \end{bmatrix}, \begin{bmatrix}-\lambda_1\\1\\ -\lambda_2 \end{bmatrix} \right\rangle = 0\).
Computing the first gives us \(\lambda_1=\frac12\), while the second gives \(\lambda_2 = -\frac12\).
Therefore, \(\pi_U(e_2) = \begin{bmatrix} \frac12 \\ 0\\ -\frac12 \end{bmatrix}\).
Solution
We have \(d(e_2,U) = \lVert e_2-\pi_U(e_2) \rVert = \sqrt{\left\langle \begin{bmatrix}-\frac12 \\ 1\\ \frac12\end{bmatrix},\begin{bmatrix}-\frac12 \\ 1\\ \frac12\end{bmatrix}\right\rangle} = 1\).
Solution
Let \(x\in V\). Observe that \(((id-\pi)\circ(id-\pi))(x) = (id-\pi)(x-\pi(x)) = (x-\pi(x))-(\pi(x)-\pi^2(x)) = x-2\pi(x)+\pi^2(x)\). Hence \((id -\pi)\) is a projection if and only if \(x-\pi(x) = x-2\pi(x)+\pi^2(x)\). This happens if and only if \(\pi(x) = \pi^2(x)\), that is to say, where \(\pi\) is a projection.
Solution
We have \(Im(id-\pi) = \{(id-\pi)(x):x\in V\} = \{x-\pi(x):x\in V\}\). Observe that \(\pi(x-\pi(x)) = \pi(x)-\pi^2(x) = 0\), since \(\pi\) is a projection. Thus, \(Im(id-\pi)\subseteq\ker\pi\). Now, suppose \(k\in \ker \pi\). Then \(k-\pi(k)=k\), so \(k\in Im(id-\pi)\). Thus \(\ker\pi\subseteq Im(id-\pi)\). Therefore, we have that \(Im(id-\pi) = \ker\pi\). \(\lozenge\)
We have that \(\ker(id-\pi) = \{x\in V : (id-\pi)(x) = 0\} = \{x\in V : x=\pi(x)\}\). Clearly, \(\ker(id-\pi)\subseteq Im\pi\).
Take \(x\in Im\pi\). Then there exists some \(y\in V\) such that \(\pi(y)=x\). Observe that \((id-\pi)(x) = x-\pi(x) = \pi(y)-\pi^2(y) = 0\), since \(\pi\) is a projection. Hence \(Im\pi\subseteq\ker(id-\pi)\). Therefore we have that \(\ker(id-\pi) = Im\pi\).
I will assume we use the standard dot product as our inner product. Firstly, let’s get an orthogonal basis, then we can simply divide by the magnitude of each vector to get our orthonormal basis. Since \(b_1\) is our first vector, there is nothing to do for now. To get the second vector in our basis, we need a vector, \(b_2’\), perpendicular to \(b_1\), such that \(span(b_1,b_2) = span (b_1, b_2’)\). This is given by \(b_2’ = b_2 - \pi_{span(b_1)}(b_2)\).
Now, \(\pi_{span(b_1)}(b_2) = \frac{b_1b_1^\mathsf{T}}{\lVert b_1 \rVert^2} b_2 = \frac{1}{3} \begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}\begin{bmatrix}-1\\2\\0\end{bmatrix} = \frac{1}{3}\begin{bmatrix}1\\1\\1\end{bmatrix}\).
Therefore, \(b_2’ = \begin{bmatrix}-\frac43 \\ \frac53 \\ -\frac13 \end{bmatrix}\).
Hence, our orthonormal basis, \(C\), is given by \(C=\left\{\frac{1}{\sqrt3}\begin{bmatrix} 1\\1\\1 \end{bmatrix} , \frac{1}{\sqrt42}\begin{bmatrix}-4\\5\\ -1 \end{bmatrix}\right\}\).
Solution
Take \(x = (x_1,\dots, x_n)\) and \(y=(1,\dots,1)\). The Cauchy-Schwarz tells us that \(\langle x, y \rangle \leq \lVert x \rVert \lVert y \rVert\), so we have \(\sum_{i=1}^{n}x_i \leq \sqrt{\sum_{i=1}^{n}{x_i^2}}\sqrt{n}\). Using \(\sum x_i = 1\) and squaring both sides (noting that both sides are indeed positive!) gives us the required result.
Solution
Take \(x = \left(\sqrt{x_1},\dots, \sqrt{x_n}\right)\) and \(y = \left(\frac{1}{\sqrt{x_1}},\dots,\frac{1}{\sqrt{x_n}}\right)\). Then Cauchy-Schwarz gives us that \(n \leq \sqrt{\sum x_i} \sqrt{\sum \frac{1}{x_i}}\). The same trick as before gives us the required result.
I will assume we need to rotate these vectors anticlockwise. A rotation about an angle \(\theta\) is given by the matrix \(\begin{bmatrix} \cos\theta&-\sin\theta\\ \sin\theta&\cos\theta \end{bmatrix}\). Thus for \(\theta=30^\circ\), we have \(R=\begin{bmatrix} \frac{\sqrt{3}}{2}&-\frac{1}{2}\\ \frac12 & \frac{\sqrt3}{2} \end{bmatrix} = \frac12 \begin{bmatrix} \sqrt3 & -1\\ 1&\sqrt3 \end{bmatrix}\).
Therefore, \(Rx_1 = \frac12 \begin{bmatrix} 2\sqrt3-3\\2+3\sqrt3 \end{bmatrix}\) and \(Rx_2 = \frac12 \begin{bmatrix} 1\\ -\sqrt3 \end{bmatrix}\).
Matrix Decompositions
Solution
We have \(\det A = 1(4\cdot4-6\cdot2) - 3(2\cdot 4 - 6\cdot 0) + 5(2\cdot 2-4\cdot 0) = 0\).
Solution
We have \(\det A = 1\cdot 4 \cdot 4 + 2\cdot 2\cdot 5 + 0\cdot 3\cdot 6 - 5\cdot 4\cdot 0 - 6\cdot 2\cdot 1 - 4\cdot 3\cdot 2 = 0\).
Perform Gaussian elimination to “fix” the first columns. I have used only the rule allowing me to add a multiple of a row to a different row, which doesn’t change the determinant. We have \(\begin{bmatrix} 2&0&1&2&0\\ 0&-1&-1&-1&1\\ 0&0&1&0&3\\ 0&0&3&1&2\\ 0&0&-1&-1&1 \end{bmatrix}\).
From here, we can either continue with Gaussian elimination to get our matrix into upper triangular form, then multiply the entries on the diagonal together (remembering to take into account any elementary operations which would change the determinant!), or we can simply compute the determinant of the lower-right \(3\times 3\) matrix, since this is quick to do by hand (it is \(-3\)). Thus the determinant of the overall matrix is \(2\cdot -1\cdot -3 = 6\).
Solution
If we solve the equation \(\det(A-\lambda I) = 0\) for \(\lambda\), we obtain \(\lambda = 1\) only. (Or, indeed, we can observe that \(A\) is in lower triangular form, so the eigenvalues are the entries on the main diagonal.) The space \(E_1 = \{x\in \mathbb{R}^2 : (A-I)(x)=0\} = span\{(0,1)\}\).
Solution
Again, solving \(\det(B-\lambda I ) = 0\), we find that \(\lambda=2\) or \(\lambda = -3\). We then have \(E_2 = span\{(1,2)\}\) and \(E_{-3} = span\{ (-2,1) \}\).
If we take \(\det(A-\lambda I)\) and simplify, we have \((\lambda-2)(\lambda-1)(\lambda+1)^2\)., so we have three eigenvalues. For \(\lambda=2,1\), our eigenspace will certainly have dimension 1. For \(\lambda = -1\), it could (at this stage!) have dimension 1 or 2. Observe that \((1,0,1,1)\) is an eigenvector with eigenvalue 2, and \((1,1,1,1)\) is an eigenvector with eigenvalue 1. Thus \(E_2=span\{(1,0,1,1)\}\) and \(E_1 = span\{(1,1,1,1)\}\).
Now observe that \(rank(A+I)=3\), so there can only be one linearly independent eigenvector with eigenvalue -1. Note that \((0,1,1,0)\) will do. Hence \(E_{-1} = span\{(0,1,1,0)\}\).
Only the first two are diagonalisable – they are diagonal, so there is nothing to prove! The other two, however, are not diagonalisable – each only has one linearly independent eigenvector, whereas we need there to exist a basis of eigenvectors to yield diagonalisability. Only the first and third matrices are invertible – the determinants are non-zero, while the other two matrices have determinant zero.
This tells us that diagonalisability is indeed unrelated to invertibility!
Solution
The characteristic polynomial is \((5-\lambda)(1-\lambda)^2\). Now, \(rank(A-I) = 2\), so there is only one linearly independent eigenvector for \(\lambda=1\). Hence \(A\) is not diagonalisable. We have \(E_5 = span\{(1,1,0)\}\), and \(E_1=span\{(-3,1,0)\}\).
Solution
The characteristic polynomial here is \(-\lambda^3(1-\lambda)\), so the eigenvalues are 1, and 0 (with multiplicity 3). Observe that \(rank(A-0I) = rank A = 1\), so there are three linearly independent eigenvectors for the eigenvalue 0. With the other eigenvector for \(\lambda=1\), we will have a basis of eigenvectors, and hence \(A\) will be diagonalisable. We have \(E_1 = span\{(1,0,0,0)\}\), and \(E_0 = span\{(1,-1,0,0),(0,0,1,0),(0,0,0,1)\}\).
Solution
If we form the characteristic polynomial, we have \(\lambda^2-4\lambda +8\), which has no roots in \(\mathbb{R}\). However, if we extend to \(\mathbb{C}\), then we will be able to diagonalise the matrix.
Solution
This is a symmetric matrix, and is therefore diagonalisable. Its eigenvalues are \(3\), and \(0\) with multiplicity two, so its diagonal form is \(\begin{bmatrix} 3&0&0\\ 0&0&0\\ 0&0&0 \end{bmatrix}\), and a basis of eigenvectors is \(\{(1,1,1),(1,-1,0),(1,0,-1)\}\).
Solution
Here, we have three distinct eigenvalues, and \(\lambda = 4\) has multiplicity two. However, \(rank(A-4I) = 3\), so there is only one linearly independent eigenvector, and this \(A\) cannot have a basis of eigenvectors, so it is not diagonalisable.
Solution
Again here we have two eigenvectors – \(\lambda=1\) with multiplicity one and \(\lambda=2\) with multiplicity two. However, this time, observe that \(rank(A-2I)=1\), so there are indeed two linearly independent eigenvectors for this eigenvalue. Thus \(A\) is diagonalisable, with diagonal form \(\begin{bmatrix} 1&0&0\\ 0&2&0\\ 0&0&2 \end{bmatrix}\), with eigenvectors \(\{(3,-1,3),(2,1,0),(2,0,1)\}\).
First, we compute \(A^{\mathsf{T}}A = \begin{bmatrix} 13&12&2\\ 12&13&-2\\ 2&-2&8 \end{bmatrix}\). We can diagonalise this to find \(D = \begin{bmatrix} 25&0&0\\ 0&9&0\\ 0&0&0 \end{bmatrix}\) and \(P = \begin{bmatrix} \frac{1}{\sqrt2}&\frac{1}{\sqrt{18}}&-\frac23\\ \frac{1}{\sqrt2}&-\frac{1}{\sqrt{18}}&\frac23\\ 0&\frac{4}{\sqrt{18}}&\frac13 \end{bmatrix}\).
We take the square roots of the entries of \(D\) to find \(\Sigma = \begin{bmatrix} 5&0&0\\ 0&3&0 \end{bmatrix}\), with \(V\) equalling our \(P\) above.
From here, we compute \(u_1 = \frac1{\sigma_1}Av_1 = \frac15\begin{bmatrix}3&2&2\\2&3&-2\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\\0\end{bmatrix} =\begin{bmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\end{bmatrix}\) , and \(u_2 = \frac{1}{\sigma_2} A v_2 = \frac13 \begin{bmatrix}3&2&2\\2&3&-2\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt{18}}\\ -\frac{1}{\sqrt{18}}\\frac{4}{\sqrt{18}}\end{bmatrix} =\begin{bmatrix}\frac{1}{\sqrt2}\\ -\frac{1}{\sqrt2}\end{bmatrix}\), giving \(U=\frac{1}{\sqrt2}\begin{bmatrix}1&1\\1&-1\end{bmatrix}\).
It can be checked that \(A = U\Sigma V^{\mathsf{T}}\), indeed!
Observe that the eigenvalues of \(A\) are complex, so we cannot simply find the eigendecomposition. Proceeding as in the previous question, we have \(A^{\mathsf{T}}A = \begin{bmatrix}5&3\\3&5\end{bmatrix}\), which when we perform the eigendecomposition on this new matrix, we obtain \(D = \begin{bmatrix}8&0\\0&2\end{bmatrix}\) and \(P=\frac{1}{\sqrt2}\begin{bmatrix}1&-1\\1&1\end{bmatrix}\). This \(P\) is again our required \(V\), and we have \(\Sigma = \begin{bmatrix}2\sqrt2 & 0\\0&\sqrt2\end{bmatrix}\).
As before, we can now compute
\(u_1 = \frac{1}{2\sqrt2}\begin{bmatrix}2&2\\ -1&1\end{bmatrix}\begin{pmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\end{pmatrix} = \begin{bmatrix}1\\0\end{bmatrix}\), and similarly \(u_2 = \begin{bmatrix}0\\1\end{bmatrix}\). Hence, our \(U\) turns out to be the identity matrix.
Using our Singular value decomposition from Question 8, we construct \(A_1 = \sigma_1 u_1 v_1^{\mathsf{T}} = 5 \begin{bmatrix}\frac{1}{\sqrt2}\\frac{1}{\sqrt2}\end{bmatrix}\begin{bmatrix}\frac{1}{\sqrt2}&\frac{1}{\sqrt2}&0\end{bmatrix} =\frac52 \begin{bmatrix}1&1&0\\1&1&0\end{bmatrix}\), and similarly \(A_2 = \frac12 \begin{bmatrix}1&-1&4\\ -1&1&-4\end{bmatrix}\).
Then \(A_1\) and \(A_2\) both have rank one, with \(A=A_1+A_2\), as required.
We need only show this in one direction, since we can replace \(A\) with \(A^{\mathsf{T}}\) below, and the argument still hold true. Suppose \(\lambda\neq 0\) is an eigenvalue of \(A^{\mathsf{T}}A\). Then there is some vector \(x\neq 0\) such that \(A^{\mathsf{T}}Ax = \lambda x\). Thus, \(AA^{\mathsf{T}}Ax = \lambda Ax\). Therefore, \(Ax\) is an eigenvector of \(AA^{\mathsf{T}}\), with eigenvalue \(\lambda\). Please not that \(x\) is in general not the eigenvector of \(AA^{\mathsf{T}}\), in fact it can’t even be multiplied by it, because it is $n$-dimensional, and not $m$-dimensional.
The left hand side describes the largest \(\Vert Ax \Vert_2\) can be, relative to \(\Vert x \Vert_2\). That is to say, it represents the biggest scaling that can take place under \(A\). If we write \(A=U\Sigma V^{\mathsf{T}}\), and then note that \(U\) and \(V^{\mathsf{T}}\) are both change of basis matrices, we deduce that only \(\Sigma\) is doing any scaling. But \(\Sigma\) is an ``almost diagonal’’ matrix, and hence it scales based on its non-zero entries only. Remember that (by convention!) all the entries on the main diagonal of \(\Sigma\) are non-negative. Hence the largest such value represents the biggest scaling factor, which is the right-hand side of the identity.
Vector Calculus
Firstly, let’s rewrite \(f(x)\) as \(f(x) = 4\log (x)\sin(x^3)\). Then, \(f’(x) = \frac4x\sin(x^3)+12x^2\log(x)\cos(x^3)\).
If we rewrite our function as \(f(x) = (1+\exp(-x))^{-1}\), then we have \(f’(x) = \frac{\exp(-x)}{(1+\exp(-x))^2}\).
We have \(f’(x) = \frac{\mu-x}{\sigma^2} f(x)\).
We compute the first five derivatives of our function at 0. We have \(f(0)=f’(0)=1\), \(f^{(2)}(0)=f^{(3)}(0)=-1\), and \(f^{(4)}(0)=f^{(5)}(0)=1\). The Taylor polynomial \(T_5(x) = 1+x-\frac12 x^2 -\frac16 x^3 + \frac1{24} x^4 + \frac1{120}x^5\). The lower-order Taylor polynomials can be found by truncating this expression appropriately.
Solution
We can see below that \(\frac{\partial f_1}{\partial x}\) has dimension \(1\times2\); \(\frac{\partial f_2}{\partial x}\) has dimension \(1\times n\); and \(\frac{\partial f_3}{\partial x}\) has dimension \(n^2\times n\).
Solution
We have \(\frac{\partial f_1}{\partial x} = \begin{bmatrix} \cos(x_1)\cos(x_2)&-\sin(x_1)\sin(x_2) \end{bmatrix}\); \(\frac{\partial f_2}{\partial x} = y^{\mathsf{T}}\). (Remember, \(y\) is a column vector!)
Solution
Note that \(xx^{\mathsf{T}}\) is the matrix \(\begin{bmatrix} x_1^2 & x_1x_2 & \cdots & x_1x_n \\ x_2x_1 & x_2^2 & \cdots & x_2x_n \\ \vdots & \vdots & \ddots & \vdots \\ x_nx_1 & x_nx_2 & \cdots & x_n^2 \end{bmatrix}\). Thus its derivative will be a higher-order tensor. However, if we consider the matrix to be an $n^2$-dimensional object in its own right, we can compute the Jacobian. Its first row consists of \[\begin{bmatrix} 2x_1&x_2&\cdots&x_n | x_2&0&\cdots&0 | \cdots |x_n&0&\cdots&0 \end{bmatrix}\] where I have inserted a vertical bar every \(n\) columns, to aid readability.
We have \(\frac{df}{dt} = \cos(\log(t^{\mathsf{T}}t))\cdot\frac1{t^{\mathsf{T}}t}\cdot \begin{bmatrix} 2t_1&2t_2&\cdots&2t_D \end{bmatrix}= \cos(\log(t^{\mathsf{T}}t))\cdot\frac{2t^{\mathsf{T}}}{t^{\mathsf{T}}t}\).
For \(g\), if we explicitly compute \(AXB\) and find its trace, we have that \(g(X) = \sum_{k=1}^D \sum_{j=1}^F \sum_{i=1}^E a_{ki}x_{ij}b_{jk}\). Thus we have, \(\frac{\partial g}{\partial x_{ij}} = \sum_{k=1}^D b_{jk}a_{ki}\), and this is the $(i,j)$-th entry of the required derivative.
If we look at the last equation, we can see that the $(i, j)$-th partial derivative is the product of $j$-th row of \(B\) and $i$-th column of \(A\). Therefore, the right side of the equation is \(\left(BA\right)_{ji}\). Since the order of the indices is reversed, i.e., the $(i, j)$-th partial derivative is the $(j, i)$-th element of \(BA\), \(\frac{dg}{dX} = (BA)^{\mathsf{T}} = A^{\mathsf{T}}B^{\mathsf{T}}\).
Checking the dimensions, we can see that \(\left(BA\right)^{\mathsf{T}} \in R^{E \times F}\) which corresponds with the dimensions of \(X\).
Solution
The chain rule tells us that \(\frac{df}{dx} = \frac{df}{dz}\frac{dz}{dx}\), where \(\frac{df}{dz}\) has dimension \(1\times 1\), and \(\frac{dz}{dx}\) has dimension \(1\times D\). We know \(\frac{dz}{dx}=2x^{\mathsf{T}}\) from \(f\) in Question 6. Also, \(\frac{df}{dz} = \frac{1}{1+z}\). Therefore, \(\frac{df}{dx} = \frac{2x^{\mathsf{T}}}{1+x^{\mathsf{T}}x}\).
Solution
Here we have \(\frac{df}{dz}\) is an \(E\times E\) matrix, namely \(\begin{bmatrix} \cos z_1 & 0 & \cdots & 0\\ 0 & \cos z_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0&\cdots&\cos z_E \end{bmatrix}\).
Also, \(\frac{dz}{dx}\) is an $E× D$-dimensional matrix, namely \(A\) itself.
The overall derivative is obtained by multiplying these two matrices together, which will again give us an $E× D$-dimensional matrix.
Solution
We have \(\frac{df}{dz}\) has dimension \(1\times 1\), and is simply \(-\frac12 \exp(-\frac12 z)\). Now, \(\frac{dz}{dy}\) has dimension \(1\times D\), and is given by \(y^{\mathsf{T}}(S^{-1}+(S^{-1})^{\mathsf{T}})\).
Finally, \(\frac{dy}{dx}\) has dimension \(D\times D\), and is just the identity matrix.
Again, we multiply these all together to get our final derivative.
Solution
If we explicitly write out \(xx^{\mathsf{T}}+\sigma^2I\), and compute its trace, we find that \(f(x) = x_1^2 + \dots + x_n^2 + n\sigma^2\). Hence, \(\frac{df}{dx} = 2x^{\mathsf{T}}\).
Solution
Here, \(\frac{df}{dz} = \begin{bmatrix} \frac{1}{\cosh^2z_1}&0&\cdots&0\\ 0&\frac{1}{\cosh^2z_2}&\cdots&0\\ \vdots & \vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{1}{\cosh^2z_M} \end{bmatrix}\), while \(\frac{dz}{dx} = A\), as in Question 7b.
Finally, \(\frac{df}{dx}\) is given by the product of these two matrices.
Piecing this together, replacing \(z\) with \(t(\epsilon,\nu)\) throughout, we have that \(g(\nu) = \log(p(x,t(\epsilon,\nu))) - \log(q(t(\epsilon,\nu),\nu))\). Therefore, $\frac{dg}{d\nu} = \frac{p’(x,t(\epsilon,\nu))\cdot t’(\epsilon,\nu) }{p(x,t(\epsilon,\nu))} - \frac{q’(t(\epsilon,\nu),\nu)\cdot t’(\epsilon,\nu)}{q(t(\epsilon,\nu),\nu)} $.
Probability and Distributions
The marginal distributions are obtained by summing the probabilies over all the values of the variable being marginalized. Thus, to obtain \(p(x)\) we sum over columns (i.e., over the values corresponding to different \(y\)):
\begin{aligned} p(x_1) &= P(X = x_1) = P(X = x_1, Y = y_1) + P(X = x_1, Y = y_2) + P(X = x_1, Y = y_3) = 0.01 + 0.05 + 0.1 = 0.16 \\ p(x_2) &= P(X = x_2) = P(X = x_2, Y = y_1) + P(X = x_2, Y = y_2) + P(X = x_2, Y = y_3) = 0.02 + 0.1 + 0.05 = 0.17\\ p(x_3) &= P(X = x_3) = P(X = x_3, Y = y_1) + P(X = x_3, Y = y_2) + P(X = x_3, Y = y_3) = 0.03 + 0.05 + 0.03 = 0.11\\ p(x_4) &= P(X = x_4) = P(X = x_4, Y = y_1) + P(X = x_4, Y = y_2) + P(X = x_4, Y = y_3) = 0.1 + 0.07 + 0.05 = 0.22\\ p(x_5) &= P(X = x_5) = P(X = x_5, Y = y_1) + P(X = x_5, Y = y_2) + P(X = x_5, Y = y_3) = 0.1 + 0.2 + 0.04 = 0.34 \end{aligned}
As a correctness check, note that this distribution satisfies the normalization condition, i.e. that sum of the probabilities is \(1\):
\begin{equation} \sum_{i=1}^5 p(x_i) = 1 \end{equation}
The marginal distribution \(p(y)\) can be obtained in a similar way, by summing the matrix rows:
\begin{aligned} p(y_1) &= P(Y = y_1) = \sum_{i=1}^5 P(X = x_i, Y = y_1) = 0.01 + 0.02 + 0.03 + 0.1 + 0.1 = 0.26 \\ p(y_2) &= P(Y = y_2) = \sum_{i=1}^5 P(X = x_i, Y = y_2) = 0.05 + 0.1 + 0.05 + 0.07 + 0.2 = 0.47 \\ p(y_3) &= P(Y = y_3) = \sum_{i=1}^5 P(X = x_i, Y = y_3) = 0.1 + 0.05 + 0.03 + 0.05 + 0.04 = 0.27 \end{aligned}
We can again check that the normalization condition is satisfied:
\begin{equation} \sum_{i=1}^3p(y_i) = 1 \end{equation}
To determine conditional distributions we use the definition of the conditional probability:
\[P(X = x , Y = y_1) = P(X = x | Y = y_1)P(Y = y_1) = p(x | Y = y_1) p(y_1).\]
Thus,
\begin{align*} p(x_1 | Y = y_1) &= \frac{P(X = x_1, Y = y_1)}{p(y_1)} = \frac{0.01}{0.26} \approx 0.038\\ p(x_2 | Y = y_1) &= \frac{P(X = x_2, Y = y_1)}{p(y_1)} = \frac{0.02}{0.26} \approx 0.077\\ p(x_3 | Y = y_1) &= \frac{P(X = x_3, Y = y_1)}{p(y_1)} = \frac{0.03}{0.26} \approx 0.115\\ p(x_4 | Y = y_1) &= \frac{P(X = x_4, Y = y_1)}{p(y_1)} = \frac{0.1}{0.26} \approx 0.385\\ p(x_5 | Y = y_1) &= \frac{P(X = x_5, Y = y_1)}{p(y_1)} = \frac{0.1}{0.26} \approx 0.385 \end{align*}
Likewise the conditional distribution \(p(y | X = x_3)\) is given by
\begin{align*} p(y_1 | X = y_3) &= \frac{P(X = x_3, Y = y_1)}{p(x_3)} = \frac{0.03}{0.11} \approx 0.273\\ p(y_2 | X = y_3) &= \frac{P(X = x_3, Y = y_2)}{p(x_3)} = \frac{0.05}{0.11} \approx 0.454\\ p(y_3 | X = y_3) &= \frac{P(X = x_3, Y = y_3)}{p(x_3)} = \frac{0.03}{0.11} \approx 0.273 \end{align*}
We can write the probability density of the two-dimensional distribution as \[p(x,y)= 0.4\mathcal{N}\left(x, y|\begin{bmatrix} 10\\ 2\end{bmatrix}, \begin{bmatrix} 1&0\\0&1\end{bmatrix}\right)+ 0.6\mathcal{N}\left(x, y|\begin{bmatrix} 0\\ 0\end{bmatrix}, \begin{bmatrix} 8.4&2.0\\2.0&1.7\end{bmatrix}\right)\]
The marginal distribution of a weighted sum of distributions is given by the weighted sum of marginals, whereas the marginals of a bivariate normal distribution \(\mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})\) are obtained according to the rule
\[\int \mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})dy= \mathcal{N}(x|\mu_x, \Sigma_{xx}), \\ \int \mathcal{N}(x,y|\mathbf{\mu},\mathbf{\Sigma})dx = \mathcal{N}(y|\mu_y, \Sigma_{yy}) \]
Thus, the marginals of the distribution of interest are
\begin{gather*} p(x) = 0.4\mathcal{N}(x| 10, 1) + 0.6\mathcal{N}(x| 0, 8.4),\\ p(y) = 0.4\mathcal{N}(x| 2, 1) + 0.6\mathcal{N}(x| 0, 1.7) \end{gather*}
The mean of a weighted sum of two distributions is the weighted sum of their averages
\begin{gather*} \mathbb{E}_X[x] = 0.4*10 + 0.6*0 = 4,\\ \mathbb{E}_Y[y] = 0.4*2 + 0.6*0 = 0.8 \end{gather*}
The mode of a continuous distribution is a point where this distribution has a peak. It can be determined by solving the extremum condition for each of the marginal distributions:
\begin{gather*} \frac{dp(x)}{dx} = 0,\\ \frac{dp(y)}{dy} = 0 \end{gather*}
In the case of a mixture of normal distributions these equations are non-linear and can be solved only numerically. After finding all the solutions of these equations one has to verify for every solution that it is a peak rather than an inflection point, i.e. that at this point
\[\frac{d^2p(x)}{dx^2} < 0 \text{ or } \frac{d^2p(y)}{dy^2} < 0\]
The medians \(m_x, m_y\) can be determined from the conditions
\begin{gather*} \int_{-\infty}^{m}p(x)dx = \int^{+\infty}_{m}p(x)dx,\\ \int_{-\infty}^{m}p(y)dy = \int^{+\infty}_{m}p(y)dy \end{gather*}
Again, these equations can be solved here only numerically.
The mean of a two-dimensional distribution is a vector of means of the marginal distributions
\[\mathbf{\mu} = \begin{bmatrix}4\\0.8\end{bmatrix}\]
The mode of two dimensional distribution is obtained first by solving the extremum conditions
\[\frac{\partial p(x,y)}{\partial x} = 0, \frac{\partial p(x,y)}{\partial y} = 0\]
and then verifying for every solution that it is indeed a peak, i.e.
\begin{gather*} \frac{\partial^2 p(x,y)}{\partial x^2} < 0, \frac{\partial^2 p(x,y)}{\partial y^2} < 0,\\ \det\left( \begin{bmatrix} \frac{\partial^2 p(x,y)}{\partial x^2} & \frac{\partial^2 p(x,y)}{\partial x\partial y}\\ \frac{\partial^2 p(x,y)}{\partial x\partial y} & \frac{\partial^2 p(x,y)}{\partial y^2} \end{bmatrix} \right) > 0 \end{gather*}
Again, these equations can be solved only numerically.
The conjugate prior to the Bernoulli distribution is the Beta distribution \[p(\mu | \alpha, \beta) =\frac{1}{\mathcal{B}(\alpha, \beta)} \mu^{\alpha -1}(1-\mu)^{\beta-1} \propto \mu^{\alpha -1}(1-\mu)^{\beta-1},\]
where \(\alpha,\beta\) are not necessarily integers and the normalisation coefficient si the Beta function defined as
\[\mathcal{B}(\alpha, \beta) = \int_0^1 t^{\alpha -1}(1-t)^{\beta-1}dt\]
The likelihood of observing data \(\{x_1, x_2, …, x_N\}\) is
\[p(x_1, …, x_N|\mu) = \prod_{i=1}^Np(x_i|\mu) = \prod_{i=1}^N \mu^{x_i}(1-\mu)^{1-x_i} = \mu^{\sum_{i=1}^N x_i}(1-\mu)^{N-\sum_{i=1}^N x_i}\]
The posterior distribution is proportional to the product of this likelihood with the prior distribution (Bayes theorem):
\[p(\mu |x_1, …, x_N) \propto p(x_1, …, x_N|\mu)p(\mu | \alpha, \beta) \propto \mu^{\sum_{i=1}^N x_i + \alpha -1}(1-\mu)^{N-\sum_{i=1}^N x_i +\beta -1}\]
This is also a Beta distribution, i.e. our choice of the gonjugate prior was correct. The normalization constant is readily determined:
\[p(\mu |x_1, …, x_N) = \frac{1}{\mathcal{B}(\sum_{i=1}^N x_i + \alpha -1, N-\sum_{i=1}^N x_i +\beta -1)} \mu^{\sum_{i=1}^N x_i + \alpha -1}(1-\mu)^{N-\sum_{i=1}^N x_i +\beta -1}\]
The probabilities of picking a mango or an apple from the first bag are given by
\begin{gather*} p(\text{mango} |1) = \frac{4}{6} = \frac{2}{3}\\ p(\text{apple} |1) = \frac{2}{6} = \frac{1}{3} \end{gather*}
The probabilities of picking a mango or an apple from the second bag are
\begin{gather*} p(\text{mango} |2) = \frac{4}{8} = \frac{1}{2}\\ p(\text{apple} |2) = \frac{4}{8} = \frac{1}{2} \end{gather*}
The probability of picking the first or the second bag are equal to the probabilities of head and tail respectively:
\begin{gather*} p(1) = 0.6,\\ p(2) = 0.4 \end{gather*}
We now can obtain the probability that the mango was picked from the second bag using Bayes’ theorem:
\[ p(2 | \text{mango}) = \frac{p(\text{mango} | 2)p(2)}{p(\text{mango})} = \frac{p(\text{mango} | 2)p(2)}{p(\text{mango} | 1)p(1) + p(\text{mango} | 2)p(2)} = \frac{\frac{1}{2}0.4}{\frac{2}{3}0.6 + \frac{1}{2}0.4} = \frac{1}{3} \]
\(\mathbf{x}_{t+1}\) is obtained from \(\mathbf{x}_{t}\) by a linear transformation, \(\mathbf{A}\mathbf{x}_{t}\) and adding a Gaussian random variabme \(\mathbf{w}\). Initial distribution for \(\mathbf{x}_{0}\) is a Gaussian distribution, a linear transformation of a Gaussian random variable is also a Gaussian random variable, whereas a sum of Gaussian random variables is a Gaussian random variable. Thus, the joint distribution \(p(\mathbf{x}_{0}, \mathbf{x}_{1},…,\mathbf{x}_{T})\) is also a Gaussian distribution.
Solution
Let \(\mathbf{z} = \mathbf{A}\mathbf{x}_{t+1}\). Since this is a linear transformation of a Gaussian random variable, \(\mathbf{x}_t \sim \mathcal{N}(\mathbf{\mu}_t,\mathbf{\Sigma})\), then \(\mathbf{z}\) is distributed as (see Eq. (6.88)) \[\mathbf{z} \sim \mathcal{N}(\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T),\]
whereas the mean and the covariance of a sum of two Gaussian random variables are given by the sum of the means and the covariances of these variables, i.e.,
\[\mathbf{x}_{t+1} = \mathbf{z} + \mathbf{w} \sim \mathcal{N}(\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}),\]
That is
\[p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}). \]
Solution
If we assume that \(\mathbf{x}_{t+1}\) is fixed, then \(\mathbf{y}_{t+1} = \mathbf{C}\mathbf{x}_{t+1} + \mathbf{v}\) follows the same distribution as \(\mathbf{v}\), but with the mean shifted by \(\mathbf{C}\mathbf{x}_{t+1}\), i.e. \[ p(\mathbf{y}_{t+1}|\mathbf{x}_{t+1}, \mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}). \]
The the joint probability is obtained as
\[ p(\mathbf{y}_{t+1}, \mathbf{x}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)= p(\mathbf{y}_{t+1}|\mathbf{x}_{t+1}, \mathbf{y}_1,…,\mathbf{y}_t) p(\mathbf{x}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}) \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q}). \]
Solution
Let us introduce temporary notation \[ \mathbf{\mu}_{t+1} = \mathbf{A}\mathbf{\mu}_t,\\ \mathbf{\Sigma}_{t+1} = \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q},\\ p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t) = \mathcal{N}(\mathbf{\mu}_{t+1}, \mathbf{\Sigma}_{t+1}) \]
Then \(\mathbf{y}_{t+1}\) is obtained in terms of the parameters of distribution \(p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t)\) following the same steps as question 1), with the result
\[ p(\mathbf{y}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t)= \mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{\mu}_{t+1}, \mathbf{C}\mathbf{\Sigma}_{t+1}\mathbf{C}^T + \mathbf{R})= \mathcal{N}\left(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{A}\mathbf{\mu}_t, \mathbf{C}(\mathbf{A}\mathbf{\Sigma}\mathbf{A}^T+ \mathbf{Q})\mathbf{C}^T + \mathbf{R}\right). \]
The required conditional distribution is then obtained as
\[ p(\mathbf{x}_{t+1}|\mathbf{y}_1,…,\mathbf{y}_t, \mathbf{y}_{t+1})= \frac{p(\mathbf{y}_{t+1}, \mathbf{x}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)} {p(\mathbf{y}_{t+1}| \mathbf{y}_1,…,\mathbf{y}_t)}= \frac{\mathcal{N}(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{x}_{t+1}, \mathbf{R}) \mathcal{N}(\mathbf{x}_{t+1}|\mathbf{A}\mathbf{\mu}_t, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q})} {\mathcal{N}\left(\mathbf{y}_{t+1}|\mathbf{C}\mathbf{A}\mathbf{\mu}_t, \mathbf{C}(\mathbf{A}\mathbf{\Sigma}\mathbf{A}^T + \mathbf{Q})\mathbf{C}^T + \mathbf{R}\right)} \]
The standard definition of variance is \[ \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2], \]
where
\[ \mu = \mathbb{E}_X[x]. \]
Using the properties of average we can write:
\begin{gather*} \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2] = \mathbb{E}_X[x^2 - 2x\mu +\mu^2] = \mathbb{E}_X[x^2] - \mathbb{E}_X[2x\mu] + \mathbb{E}_X[\mu^2]=\\ \mathbb{E}_X[x^2] - 2\mu\mathbb{E}_X[x] + \mu^2 = \mathbb{E}_X[x^2] - 2\mu^2 + \mu^2 = \mathbb{E}_X[x^2] - \mu^2 \end{gather*}
By substituting to this equation the definition of \(\mu\), we obtain the desired equation
\[ \mathbb{V}_X[x] = \mathbb{E}_X[(x-\mu)^2] = \mathbb{E}_X[x^2] - (\mathbb{E}_X[x])^2 \]
Let us expand the square in the left-hand side of (6.45) \[ \frac{1}{N^2}\sum_{i,j=1}^N(x_i - x_j)^2 = \frac{1}{N^2}\sum_{i,j=1}^N(x_i^2 - 2x_i x_j + x_j^2) = \frac{1}{N^2}\sum_{i,j=1}^N x_i^2 - 2\frac{1}{N^2}\sum_{i,j=1}^N x_i x_j + \frac{1}{N^2}\sum_{i,j=1}^Nx_j^2 \]
We see that the first and the last term differ only by the summation index, i.e. they are identical: \[ \frac{1}{N^2}\sum_{i,j=1}^N x_i^2 + \frac{1}{N^2}\sum_{i,j=1}^Nx_j^2= 2\frac{1}{N^2}\sum_{i,j=1}^N x_i^2 = 2\frac{1}{N}\sum_{i=1}^N x_i^2, \]
since summation over \(j\) gives factor \(N\).
The remaining term can be written as
\[ 2\frac{1}{N^2}\sum_{i,j=1}^N x_i x_j = 2\frac{1}{N^2}\sum_{i=1}^N x_i \sum_{i=1}^N x_j = 2\left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2, \]
where we again used the fact that the sum is invariant to the index of summation.
We thus have proved the required relation that
\[ \frac{1}{N^2}\sum_{i,j=1}^N(x_i - x_j)^2 = 2\frac{1}{N}\sum_{i=1}^N x_i^2 - 2\left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2 \]
Bernoulli distribution is given by \[ p(x|\mu) = \mu^x (1-\mu)^{1-x} \]
We can use relation
\[ a^x = e^{x\log a} \]
to write the Bernoulli distribution as
\[ p(x|\mu) = e^{x\log\mu + (1-x)\log(1-\mu)}= e^{x\log\left(\frac{\mu}{1-\mu}\right) + \log(1-\mu)} = h(x)e^{\theta x - A(\theta)}, \]
where the last equation is the definition of a single-parameter distribution from the exponential family, in which
\[ h(x) = 1,\\ \theta = \log\left(\frac{\mu}{1-\mu}\right) \leftrightarrow \mu = \frac{e^\theta}{1+e^\theta},\\ A(\theta) = -\log(1-\mu) = \log(1+e^\theta) \]
The binomial distribution can be transformed as \[ p(x|N,\mu) = {N\choose x} \mu^x (1-\mu)^{N-x} = {N \choose x} e^{x\log\mu + (N-x)\log(1-\mu)}= {N \choose x}e^{x\log\left(\frac{\mu}{1-\mu}\right) +N\log(1-\mu)} = h(x)e^{x\theta - A(\theta)} \]
where
\[ h(x) = {N \choose x},\\ \theta = \log\left(\frac{\mu}{1-\mu}\right),\\ A(\theta) = -N\log(1-\mu) = N\log(1+e^\theta) \]
i.e., the binomial distribution can be represented as an exponential family distribution(only \(\mu\) is treated here as a parameter, since the number of trials \(N\) is fixed.)
Similarly, the beta distribution can be transformed as
\[ p(x |\alpha, \beta) = \frac{1}{\mathcal{B}(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} = e^{(\alpha -1)\log x + (\beta -1)\log(1-x) - \log(\mathcal{B}(\alpha,\beta))}= h(x)e^{\theta_1\phi_1(x) + \theta_2\phi_2(x) -A(\theta_1, \theta_2)} \]
where
\begin{gather*} h(x) = 1,\\ \theta_1 = \alpha-1, \theta_2 = \beta-1,\\ \phi_1(x) = \log x, \phi_2(x) = \log(1-x),\\ A(\theta_1, \theta_2) = \log(\mathcal{B}(\alpha,\beta)) = \log(\mathcal{B}(1+\theta_1,1 + \theta_2)) \end{gather*}
i.e. this is a distribution form the exponential family.
The product of the two distributions is then given by
\begin{gather*} p(x|N,\mu) p(x|\alpha, \beta)={N \choose x}e^{x\log\left(\frac{\mu}{1-\mu}\right) + (\alpha-1)\log x + (\beta -1)\log(1-x) + N\log(1-\mu) - \log(\mathcal{B}(\alpha,\beta))}\\ = h(x) e^{\theta_1 \phi_1(x) + \theta_2 \phi_2(x) + \theta_3\phi_3(x) - A(\theta_1, \theta_2, \theta_3)} \end{gather*}
where
\begin{gather*} h(x) = {N \choose x},\\ \theta_1 = \alpha-1, \theta_2 = \beta-1,\theta_3 = \log\left(\frac{\mu}{1-\mu}\right)\\ \phi_1(x) = \log x, \phi_2(x) = \log(1-x), \phi_3(x) = x\\ A(\theta_1, \theta_2, \theta_3) = \log(\mathcal{B}(\alpha,\beta)) -N\log(1-\mu) = \log(\mathcal{B}(1+\theta_1,1 + \theta_2)) + N\log(1+e^\theta_3) \end{gather*}
The two normal distributions are given by
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) = (2\pi)^{-\frac{D}{2}}|\mathbf{A}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})\right],\\ \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-\frac{D}{2}}|\mathbf{B}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})\right] \end{gather*}
their product is
\[ \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-D}|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left[(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})+(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})\right]\right\} \]
The expression in the exponent can be written as
\begin{gather*} \Phi = (\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})+(\mathbf{x} - \mathbf{b})^T\mathbf{B}^{-1}(\mathbf{x} - \mathbf{b})=\\ \mathbf{x}^T\mathbf{A}^{-1}\mathbf{x} - \mathbf{a}^T\mathbf{A}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{A}^{-1}\mathbf{a}+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a}+ \mathbf{x}^T\mathbf{B}^{-1}\mathbf{x} - \mathbf{b}^T\mathbf{B}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{B}^{-1}\mathbf{b}+ \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}=\\ \mathbf{x}^T(\mathbf{A}^{-1}+\mathbf{B}^{-1})\mathbf{x}- (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})\mathbf{x}- \mathbf{x}^T(\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} \end{gather*}
we now introduce notation
\begin{gather*} \mathbf{C}^{-1} = (\mathbf{A}^{-1}+\mathbf{B}^{-1}),\\ \mathbf{c} = \mathbf{C}(\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b}),\\ \mathbf{c}^T = (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})C\text{ (This can be checked by transposing the previous equation)} \end{gather*}
The expression in the exponent now takes form
\begin{gather*} \Phi= \mathbf{x}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{C}^{-1}\mathbf{c}+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}=\\ \mathbf{x}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{x} - \mathbf{x}^T\mathbf{C}^{-1}\mathbf{c}+ \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} + \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}- \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c}=\\ (\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \end{gather*}
where we have completed the square.
The product of the two probability distributions can be now written as
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A}) \mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B}) = (2\pi)^{-D}|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}\left[(\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})+ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\}=\\ (2\pi)^{-\frac{D}{2}}|\mathbf{C}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{c})^T\mathbf{C}^{-1}(\mathbf{x} - \mathbf{c})\right] \times (2\pi)^{-\frac{D}{2}}\frac{|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}}}{|\mathbf{C}|^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left[ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\}=\\ c\mathcal{N}(\mathbf{c}|\mathbf{c}, \mathbf{C}), \end{gather*}
where we defined
\[ c = (2\pi)^{-\frac{D}{2}}\frac{|\mathbf{A}\mathbf{B}|^{-\frac{1}{2}}}{|\mathbf{C}|^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left[ \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} \right]\right\} \]
We now can used the properties that a) the determinant of a matrix product is product of the determinants, and b) determinant of a matrix inverse is the inverse of the determinant of this matrix, and write
\[ \frac{|\mathbf{A}||\mathbf{B}|}{|\mathbf{C}|}= \|\mathbf{A}||\mathbf{C}^{-1}||\mathbf{B}|= \|\mathbf{A}\mathbf{C}^{-1}\mathbf{B}|= \|\mathbf{A}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{B}|= \|\mathbf{A} + \mathbf{B}| \]
For the expression in the exponent we can write
\begin{gather*} \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \mathbf{c}^T\mathbf{C}^{-1}\mathbf{c}= \mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} + \mathbf{b}^T\mathbf{B}^{-1}\mathbf{b}- (\mathbf{a}^T\mathbf{A}^{-1} + \mathbf{b}^T\mathbf{B}^{-1})(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} (\mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b})\\ =\mathbf{a}^T\left[\mathbf{A}^{-1} - \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}\right]\mathbf{a}+ \mathbf{b}^T\left[\mathbf{B}^{-1} - \mathbf{B}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{B}^{-1}\right]\mathbf{b}- \mathbf{a}^T\mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{B}^{-1}\mathbf{b}- \mathbf{b}^T\mathbf{B}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{A}^{-1}\mathbf{a} \end{gather*}
Using the property \((\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) we obtain
\[ \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1} \mathbf{B}^{-1}= \left[\mathbf{B}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}\right]^{-1}= (\mathbf{A} + \mathbf{B})^{-1} \]
and
\begin{gather*} \mathbf{A}^{-1} - \mathbf{A}^{-1}(\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}=\\ \mathbf{A}^{-1}\left[1 - (\mathbf{A}^{-1} + \mathbf{B}^{-1})\mathbf{A}^{-1}\right]=\\ \mathbf{A}^{-1}\left[1 - \mathbf{B}(\mathbf{A} + \mathbf{B})^{-1}\mathbf{A}\mathbf{A}^{-1}\right]=\\ \mathbf{A}^{-1}\left[1 - \mathbf{B}(\mathbf{A} + \mathbf{B})^{-1}\right]=\\ \mathbf{A}^{-1}\left[(\mathbf{A} + \mathbf{B}) - \mathbf{B}\right](\mathbf{A} + \mathbf{B})^{-1}=\\ (\mathbf{A} + \mathbf{B})^{-1} \end{gather*}
we thus conclude that
\begin{gather*} c = (2\pi)^{-\frac{D}{2}}|\mathbf{A}+\mathbf{B}|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}( \mathbf{a} - \mathbf{b})^T(\mathbf{A} + \mathbf{B})^{-1}(\mathbf{a} -\mathbf{b})\right\}=\\ \mathcal{N}(\mathbf{b}|\mathbf{a}, \mathbf{A}+ \mathbf{B})=\\ \mathcal{N}(\mathbf{a}|\mathbf{b}, \mathbf{A}+ \mathbf{B}) \end{gather*}
Multivariate normal distribution, \(\mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})\) can be represented as a distribution from an exponential family:
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})= (2\pi)^{-\frac{D}{2}}|\mathbf{A}|^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(\mathbf{x} - \mathbf{a})^T\mathbf{A}^{-1}(\mathbf{x} - \mathbf{a})\right]=\\ (2\pi)^{-\frac{D}{2}} \exp\left[-\frac{1}{2}\text{tr}(\mathbf{A}^{-1}\mathbf{x}\mathbf{x}^T) + \mathbf{a}^T\mathbf{A}^{-1}\mathbf{x}- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}| \right], \end{gather*}
where we used that \(\mathbf{a}^T\mathbf{A}^{-1}\mathbf{x} = \mathbf{x}^T\mathbf{A}^{-1}\mathbf{a}\), and also write the first term as
\[ \mathbf{x}^T\mathbf{A}^{-1}\mathbf{x}= \sum_{i,j}x_i (\mathbf{A}^{-1})_{ij} x_j= \sum_{i,j}(\mathbf{A}^{-1})_{ij} x_j x_i= \sum_{i,j}(\mathbf{A}^{-1})_{ij} (\mathbf{x}\mathbf{x}^T)_{ji}= \text{tr}(\mathbf{A}^{-1}\mathbf{x}\mathbf{x}^T) \]
Representing \(\mathcal{N}(\mathbf{x}|\mathbf{b},\mathbf{B})\) in a similar way and multiplying the two distributions we readily obtain
\begin{gather*} \mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})\mathcal{N}(\mathbf{x}|\mathbf{b},\mathbf{B})=\\ (2\pi)^{-D} \exp\left\{-\frac{1}{2}\text{tr}\left[(\mathbf{A}^{-1}+ \mathbf{B}^{-1})\mathbf{x}\mathbf{x}^T\right]+ (\mathbf{a}^T\mathbf{A}^{-1}+\mathbf{b}^T\mathbf{B}^{-1})\mathbf{x}- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}|- \frac{1}{2}\mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \frac{1}{2}\log|\mathbf{B}| \right\}=\\ c\mathcal{N}(\mathbf{x}|\mathbf{c},\mathbf{C}), \end{gather*}
where we defined
\begin{gather*} \mathbf{C}^{-1} = \mathbf{A}^{-1}+ \mathbf{B}^{-1},\\ \mathbf{c}^T\mathbf{C}^{-1} = \mathbf{a}^T\mathbf{A}^{-1}+\mathbf{b}^T\mathbf{B}^{-1},\\ c = (2\pi)^{-\frac{D}{2}} \exp\left\{\frac{1}{2}\mathbf{c}^T\mathbf{C}^{-1}\mathbf{c} + \frac{1}{2}\log|\mathbf{C}|- \frac{1}{2}\mathbf{a}^T\mathbf{A}^{-1}\mathbf{a} - \frac{1}{2}\log|\mathbf{A}|- \frac{1}{2}\mathbf{b}^T\mathbf{B}^{-1}\mathbf{b} - \frac{1}{2}\log|\mathbf{B}| \right\} \end{gather*}
Coefficient \(c\) can now be reduced to the required form using the matrix transformations described in part a).
The expectation value and the conditional expectation value are given by \[ \mathbb{E}_X[x] = \int x p(x) dx,\\ \mathbb{E}_Y[f(y)] = \int f(y) p(y) dy,\\ \mathbb{E}_X[x|y] = \int x p(x|y) dx \]
We then have
\begin{gather*} \mathbb{E}_Y\left[\mathbb{E}_X[x|y]\right] = \int \mathbb{E}_X[x|y] p(y) dy\\ = \int \left[\int xp(x|y)dx\right]p(y) dy\\ = \int \int xp(x|y)p(y)dx dy\\ = \int\int xp(x,y)dxdy\\ = \int x\left[\int p(x,y) dy\right] dx\\ = \int x p(x) dx\\ = \mathbb{E}_X[x], \end{gather*}
where we used the definition fo the conditional probability density
\[ p(x|y)p(y) = p(x,y) \]
If \(\mathbf{x}\) is fixed, then \(\mathbf{y}\) has the same distribution as \(\mathbf{w}\), but with the mean shifter by \(\mathbf{A}\mathbf{x} + \mathbf{b}\), that is \[ p(\mathbf{y}|\mathbf{x}) = \mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{x} + \mathbf{b}, \mathbf{Q}) \]
Let us consider random variable \(\mathbf{u} = \mathbf{A}\mathbf{x}\), it is distributed according to
\[ p(\mathbf{u}) = \mathcal{N}(\mathbf{u}|\mathbf{A}\mathbf{\mu}_x, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T). \]
Then \(\mathbf{y}\) is a sum of two Gaussian random variables \(\mathbf{u}\) and \(\mathbf{w}\) with its mean additionally shifted by \(\mathbf{b}\), that is
\[ p(\mathbf{y}) = \mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{\mu}_x + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q}), \]
that is
\[ \mathbf{\mu}_y = \mathbf{A}\mathbf{\mu}_x + \mathbf{b},\\ \mathbf{\Sigma}_y = \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q}. \]
Like in b), assuming that \(\mathbf{y}\) is fixed we obtain the conditional distribution
\[ p(\mathbf{z}|\mathbf{y}) = \mathcal{N}(\mathbf{z}|\mathbf{C}\mathbf{y}, \mathbf{R}) \]
Since \(\mathbf{C}\mathbf{y}\) is a Gausssian random variable with distribution \(\mathcal{N}(\mathbf{C}\mathbf{\mu}_y, \mathbf{C}\mathbf{\Sigma}_y\mathbf{C}^T)\) we obtain the distribution of \(\mathbf{z}\) as that of a sum of two Gaussian random variables:
\[ p(\mathbf{z})= \mathcal{N}(\mathbf{z} |\mathbf{C}\mathbf{\mu}_y, \mathbf{C}\mathbf{\Sigma}_y\mathbf{C}^T + \mathbf{R})= \mathcal{N}(\mathbf{z} |\mathbf{C}(\mathbf{A}\mathbf{\mu}_x + \mathbf{b}), \mathbf{C}(\mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q})\mathbf{C}^T + \mathbf{R}) \]
The posterior distribution \(p(\mathbf{x}|\mathbf{y})\) can be obtained by applying the Bayes’ theorem:
\[ p(\mathbf{x}|\mathbf{y})= \frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}= \frac{\mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{x} + \mathbf{b}, \mathbf{Q})\mathcal{N}(\mathbf{x}|\mathbf{\mu}_x,\mathbf{\Sigma}_x)} {\mathcal{N}(\mathbf{y}|\mathbf{A}\mathbf{\mu}_x + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}_x\mathbf{A}^T + \mathbf{Q})} \]
Cdf is related to pdf as \[ F_x(x) = \int_{-\infty}^xdx’ f_x(x’),\\ \frac{d}{dx} F_x(x) = f_x(x) \]
and changes in the interval \([0,1]\).
The pdf of variable \(y=F_x(x)\) then can be defined as
\[ f_y(y) = f_x(x) \left|\frac{dx}{dy}\right| = \frac{f_x(x)}{\left|\frac{dy}{dx}\right|} = \frac{f_x(x)}{\left|\frac{dF_x(x)}{dx}\right|} = \frac{f_x(x)}{f_x(x)} = 1, \]
i.e. \(y\) is uniformly distributed in interval \([0,1]\).
Continuous Optimization
The first and the second derivatives of function ‘f(x)’ are given by \[ f’(x) = 3x^2 + 12 x -3,\\ f’’(x) = 6x \]
To find the positions of maxima and minima we solve equation for the points where the first derivative takes zero values:
\[ 3x^2 + 12 x -3 = 0 \Rightarrow x_{1,2} = -2 \pm \sqrt{5} \]
Substituting each of the roots into the second derivative we obtain that
\[ f’’(x_1) = f’’(-2 + \sqrt{5}) > 0,\\ f’’(x_2) = f’’(-2 - \sqrt{5}) < 0, \]
i.e. \(x_1 = -2 + \sqrt{5}\) is the minimum, whereas \(x_2 = -2 - \sqrt{5}\) is the maximum of function \(f(x)\).
The full batch update rule for gradient descent is \[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i\sum_{n=1}^{N}\left(\nabla L_n(\mathbf{\theta}_i)\right)^T \]
For a mini-batch of size \(1\) at every step we would choose only one of the data points, i.e. we would choose randomly \(n\) among the values \(1…N\) and calculate the update as
\[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i\left(\nabla L_n(\mathbf{\theta}_i)\right)^T \]
True. Indeed, any two points in the intersection of the two convex sets can be connected by a path, which belongs to each of the sets (since they are convext), and therefore is also a part of the intersection. False. Indeed, if the two sets are completely disjoint, their union may not contain the points joining a point of one set to a point of the other.
False. Subtracting one set from the other may remove some of the points that belong to the paths connecting the points of the remaining set.
True. If \(f(\mathbf{x})\) and \(g(\mathbf{x})\) are convex functions, it means that for any \(\theta \in[0,1]\): \[ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}),\\ g(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta g(\mathbf{x}) + (1-\theta)g(\mathbf{y}). \]
By adding these inequalities we immediately obtain the condition of convexity for their sum, \(h(\mathbf{x}) = f(\mathbf{x}) + g(\mathbf{x})\):
\begin{gather*} h(\theta \mathbf{x} + (1-\theta)\mathbf{y})=\\ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) + g(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}) + \theta g(\mathbf{x}) + (1-\theta)g(\mathbf{y})=\\ \theta h(\mathbf{x}) + (1-\theta)h(\mathbf{y}) \end{gather*}
False. For example; the difference of \(f(x) = ax^2\) and \(g(x) = bx^2\) is convex if \(a \geq b\), but concave otherwise
False. For example the product of two convex functions \(f(x) = (x-a)^2\) and \(g(x) = (x+a)^2\), \(h(x) = (x-a)^2(x+a)^2\) has two minima and a maximum at \(x=0\). For examlple, if we take \(x=a\), \(y=-a\) and \(\theta = 0.5\), then
\[ h(\theta x + (1-\theta)y) = h(0) = a^4 > 0 = 0.5 h(a) + 0.5 h(-a) = \theta h(x) + (1-\theta)h(y), \]
i.e. the condition fo convexity is not satisfied.
The maximum of a function is not a function, so it does not have property of convexity. If however we talk about the maximum as a convext set, consisting of only one point, then it is trivially convex.
We first introduce vectors \(\mathbf{y} = (x_0, x_1, \xi)^T\) and \(\mathbf{c} = (p_0, p_1, 1)^T\), so that \(\mathbf{p}^T\mathbf{x} + \xi = \mathbf{c}^T\mathbf{y}\).
We then introduce vector \(\mathbf{b} = (0, 3, 0)\) and matrix
\[ \mathbf{A} =\begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & -1 \end{bmatrix}\]
which allow us to write the three constraints as \(\mathbf{A}\mathbf{y} \leq \mathbf{b}\).
We now can write the problem as a standard linear program:
\begin{gather*} \max_{\mathbf{y} \in \mathbb{R}^3} \mathbf{c}^T\mathbf{y}\\ \text{subject to } \mathbf{A}\mathbf{y} \leq \mathbf{b} \end{gather*}
Let us define \(\mathbf{c} = (-5, -3)^T\), \(\mathbf{b} = (33, 8, 5, -1, 8)^T\) and \[ \mathbf{A}=\begin{bmatrix} 2 & 2\\ 2 & -4\\ -2 & 1\\ 0 & -1\\ 0 & 1 \end{bmatrix}\]
The linear program is then written as
\begin{gather*} \min_{\mathbf{x} \in \mathbb{R}^2} \mathbf{c}^T\mathbf{x}\\ \text{subject to } \mathbf{A}\mathbf{x} \leq \mathbf{b} \end{gather*}
The Lagrangian of this problem is
\[ \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = \mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b})= (\mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T\mathbf{A})\mathbf{x} - \mathbf{\lambda}^T\mathbf{b}= (\mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda})^T\mathbf{x} - \mathbf{\lambda}^T\mathbf{b} \]
Taking gradient in respect to \(\mathbf{x}\) and setting it to zero we obtain the extremum condition
\[ \mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda} = 0, \]
that is
\[ \mathcal{D}(\mathbf{\lambda}) = \min_{\mathbf{x} \in \mathbb{R}^2} \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = - \mathbf{\lambda}^T\mathbf{b} \]
that is the dual problem is given by
\begin{gather*} \max_{\mathbf{\lambda} \in \mathbb{R}^5} - \mathbf{b}^T\mathbf{\lambda}\\ \text{subject to } \mathbf{c}\mathbf{x} + \mathbf{A}^T\mathbf{\lambda} = 0 \text{ and } \mathbf{\lambda} \geq 0 \end{gather*}
In terms of the original values of the parameters it can be thus written as
\[\max_{\mathbf{\lambda} \in \mathbb{R}^5} - \begin{bmatrix}33 \\ 8 \\ 5 \\ -1 \\ 8\end{bmatrix}^T\begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5\end{bmatrix}\\ \text{subject to } -\begin{bmatrix}5 \\ 3\end{bmatrix}+\begin{bmatrix}2 & 2 & -2 & 0 & 0\\2 & -4 & 1 & -1 & 1\end{bmatrix}\begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5 \end{bmatrix} = 0\\text{and } \begin{bmatrix}\lambda_1 \\ \lambda_2 \\ \lambda_3 \\ \lambda_4 \\ \lambda_5\end{bmatrix} \geq 0\]
We introduce \(\mathbf{Q} = \begin{bmatrix} 2 & 1 \\ 1 & 4 \end{bmatrix}\), \(\mathbf{c} = \begin{bmatrix} 5\\3\end{bmatrix}\), \[\mathbf{A} = \begin{bmatrix} 1 & 0\\ -1 & 0\\ 0 & 1\\ 0 & -1 \end{bmatrix}\] and \[\mathbf{b}=\begin{bmatrix}1\\1\\1\\1\end{bmatrix}\]
Then the quadratic problem takes form
\begin{gather*} \min_{\mathbf{x} \in \mathbb{R}^2} \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x}\\ \text{subject to } \mathbf{A}\mathbf{x} \leq \mathbf{b} \end{gather*}
The Lagrangian corresponding to this problem is
\[ \mathcal{L}(\mathbf{x},\mathbf{\lambda}) = \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x} + \mathbf{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b})= \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + (\mathbf{c}^T+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{x} - \mathbf{\lambda}^T\mathbf{b} \]
where \(\mathbf{\lambda} = \begin{bmatrix}\lambda_1\\lambda_2\\lambda_3\\lambda_4\end{bmatrix}\) We minimize the Lagrangian by setting its gradient to zero, which results in
\[ \mathbf{Q}\mathbf{x} + \mathbf{c}+\mathbf{A}^T\mathbf{\lambda} = 0 \Rightarrow \mathbf{x} = -\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda}) \]
Substituting this back into the Lagrangian we obtain
\[ \mathcal{D}(\mathbf{\lambda})= \min_{\mathbf{x} \in \mathbb{R}^2} \mathcal{L}(\mathbf{x},\mathbf{\lambda}) =-(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})- \mathbf{\lambda}^T\mathbf{b} \]
The dual problem is now
\[ \max_{\mathbf{\lambda} \in \mathbb{R}^4} -(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})^T\mathbf{Q}^{-1}(\mathbf{c}+\mathbf{A}^T\mathbf{\lambda})- \mathbf{\lambda}^T\mathbf{b}\\ \text{subject to } \mathbf{\lambda} \geq 0, \] where the parameter vectors and matrices are defined above.
The primal problem can be written in the standard form as
\begin{gather*} \min_{\mathbf{w}\in\mathbb{R}^D} \frac{1}{2}\mathbf{w}^T\mathbf{w}\\ \text{subject to } 1-\mathbf{x}^T\mathbf{w} \leq 0 \end{gather*}
The Lagrangian is then
\[ \mathcal{L}(\mathbf{w},\lambda)= \frac{1}{2}\mathbf{w}^T\mathbf{w} + \lambda(1-\mathbf{x}^T\mathbf{w}) \]
Taking gradient in respect to \(\mathbf{w}\) we obtain the position of the minimum: \[ \mathbf{w} = \lambda \mathbf{x} \]
Thus, the dual Lagrangian is
\[\mathcal{D}(\lambda)= \min_{\mathbf{w}\in\mathbb{R}^D}\mathcal{L}(\mathbf{w},\lambda)=-\frac{\lambda^2}{2}\mathbf{x}^T\mathbf{x} +\lambda \]
Assuming standard dot product, convex conjugate is defined as \[ f^*(\mathbf{s}) = \sup_{\mathbf{x}\in\mathbb{R}^D} \mathbf{s}^T\mathbf{x} - f(\mathbf{x}) \]
Since function \(f(\mathbf{x})\) is continuous, differentiable and convex, looking for the supremum is equivalent to looking for the maximum, and can be done by setting the following gradient to zero:
\[ \nabla_{\mathbf{x}}\left(\mathbf{s}^T\mathbf{x} - f(\mathbf{x})\right) = \mathbf{s}^T - \nabla_\mathbf{x}f(\mathbf{x})=0 \]
that is \[ s_d = \frac{\partial}{\partial x_d}f(\mathbf{x}) = \log x_d - 1 \Rightarrow x_d = e^{s_d + 1} \]
Then
\[ f^*(\mathbf{s})= \sum_{d=1}^Ds_dx_d - \sum_{d=1}^Dx_d\log x_d= \sum_{d=1}^Ds_d e^{s_d+1} - \sum_{d=1}^De^{s_d+1}(s_d + 1)=-\sum_{d=1}^De^{s_d+1} \]
Without loss of generality we can assume that \(\mathbf{A} = \mathbf{A}^T\) is a symmetric matrix, as well as its inverse. By setting to zero gradient \[ \nabla_{\mathbf{x}}\left(\mathbf{s}^T\mathbf{x} - f(\mathbf{x})\right)= \mathbf{s}^T - \mathbf{x}^T\mathbf{A} - \mathbf{b}= 0 \]
that is
\(\mathbf{s} = \mathbf{A}\mathbf{x} + \mathbf{b} \Leftrightarrow \mathbf{x} = \mathbf{A}^{-1}(\mathbf{s} - \mathbf{b})\)
Therefore
\[ f^*(\mathbf{s}) = \frac{1}{2}(\mathbf{s} - \mathbf{b})\mathbf{A}^{-1}(\mathbf{s} - \mathbf{b}) - c \]
By definition of convex conjugate: \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}} \beta\alpha - \max\{0, 1-\alpha\}=\sup_{\alpha \in \mathbb{R}}\begin{cases}\beta\alpha, \text{ if } \alpha > 1\\beta\alpha -1 + \alpha, \text{ if } \alpha< 1 \end{cases}=\sup_{\alpha \in \mathbb{R}}\begin{cases}\beta\alpha, \text{ if } \alpha > 1\(\beta+1)\alpha -1, \text{ if } \alpha< 1\end{cases}\]
We must here distinguish three cases:
a) if \(\beta>0\), then \(\beta\alpha\) and \((\beta + 1)\alpha -1\) are both increasing with \(\alpha\), and take their maximum values at the maximum posible value of \(\alpha\), that is \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}+\infty, \text{ if } \alpha > 1\\beta, \text{ if } \alpha< 1\end{cases}=+\infty\]
b) if \(\beta < -1\) then then \(\beta\alpha\) and \((\beta + 1)\alpha -1\) are both decreasing with \(\alpha\), and take their maximum values at the minimum possible value of \(\alpha\), that is \[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}\beta, \text{ if } \alpha > 1\+\infty, \text{ if } \alpha< 1\end{cases}=+\infty\]
c) Finally, if \(-1 \leq \beta \leq 0\), then \(\beta\alpha\) is decreasing with \(\alpha\), whereas \((\beta + 1)\alpha -1\) is growing, i.e.\[L^*(\beta) = \sup_{\alpha \in \mathbb{R}}\begin{cases}\beta, \text{ if } \alpha > 1\\beta, \text{ if } \alpha< 1\end{cases}=\beta\]
We thus obtain
\[L^*(\beta) =\begin{cases} \beta \text{ if } -1\leq \beta \leq 0\+\infty \text{ otherwise} \end{cases}\]
Let us now define function \(M(\alpha) = L^*(\beta) + \frac{\gamma}{2}\beta^2\) and calculate its convex conjugate function (we assume that \(\gamma> 0\)):
\[ M^*(\alpha)= \sup_{\beta \in \mathbb{R}} \alpha\beta - L^*(\beta) - \frac{\gamma}{2}\beta^2= \sup_{\beta \in [-1, 0]} \alpha\beta - L^*(\beta) - \frac{\gamma}{2}\beta^2= \sup_{\beta \in [-1, 0]} \alpha\beta - \beta - \frac{\gamma}{2}\beta^2, \]
where the second equality is because outside of interval \([-1,0]\) we have \(-L^*(\beta) = -\infty\).
In interval \([-1,0]\) parabolic function \(\alpha\beta - \beta - \frac{\gamma}{2}\beta^2 = - \frac{\gamma}{2}\left(\beta-\frac{\alpha-1}{\gamma}\right)^2 +\frac{(1-\alpha)^2}{2\gamma}\) has a maximim at \(\beta = \frac{\alpha-1}{\gamma}\), if this point is inside this interval, and at one of the interval edges otherwise. More precisely
\[M(\alpha)=\begin{cases} 0, \text{ if } \alpha > 1\\ \frac{(1-\alpha)^2}{2\gamma}, \text{ if } 0\leq\alpha\leq 1\\ 1-\alpha-\frac{\gamma}{2}, \text{ if } \alpha > 1 \end{cases}\]
Backlinks (3)
1. Reading List /words/reading-list/
As the years approach, I use this page to list out the books I intend to read.
Contrariwise, as the years goes by, I use this list to document the books I finished that year.
2025
- Kafka on the Shore - Murakami
- Think and Grow Rich - Napolean Hill
- Algorithms - Dasgupta
- Crime and Punishment - Dostoevsky
- Pro Git
- Understanding Analysis - Abbott
- Lord of the Flies - William Golding
- The Art of Statistics - David Spiegelhalter
2026
- Zen and The Art of Motorcycle Maintainance
- Linux Pocket Guide - Daniel J. Barrett
- System Design Interview - Alex Xu
- Full-Stack Web Development with TypeScript 5 - Mykyta Chernenko
- Hamlet - Shakespeare
- The Count of Monte Cristo - Alexander Dumas (finish)
- Mathematics for Machine Learning Deisenroth, Faisal and Ong
- Dive into Design Patterns - Alexander Shvets
2027
- Designing Data-Intensive Applications - Kleppmann
- Efficient Linux at the Command Line - Daniel Barrett
- The Almanack of Naval Ravikant
- The Three Theban Plays - Sophocles
- Aeneid - Virgil
- Effective Python
- Learning Go
- All of Statistics - Larry Wasserman
2028
- Probability Theory: The Logic of Science - Jaynes
- Networked Life - Mung Chiang
- 48 Laws of Power
- Steve Jobs - Walter Isaacson
- Dickens
- Goethe
2029
- Probabilistic Machine Learning - Murphy
- Pushkin
- The Intelligent Investor - Benjamin Graham
2. Wiki /wiki/
Knowledge is a paradox. The more one understand, the more one realises the vastness of his ignorance.
3. Books /words/library/books/
Here are the books that I have taken the time to create metadata and/or notes for.