\item \points{30} {\bf Incomplete, Positive-Only Labels} In this problem we will consider training binary classifiers in situations where we do not have full access to the labels. In particular, we consider a scenario, which is not too infrequent in real life, where we have labels only for a subset of the positive examples. All the negative examples and the rest of the positive examples are unlabelled. We formalize the scenario as follows. Let $\{(x^{(i)}, t^{(i)})\}_{i=1}^\nexp$ be a standard dataset of i.i.d distributed examples. Here $x^{(i)}$'s are the inputs/features and $t^{(i)}$ are the labels. Now consider the situation where $t^{(i)}$'s are not observed by us. Instead, we only observe the labels of some of the positive examples. Concretely, we assume that we observe $y^{(i)}$'s that are generated by \begin{align*} & \forall x, ~~ p(y^{(i)} = 1\mid t^{(i)}=1, x^{(i)}=x) = \alpha, \\ & \forall x, ~~ p(y^{(i)} = 0 \mid t^{(i)}=1, x^{(i)}=x) = 1- \alpha\\ & \forall x, ~~ p(y^{(i)} = 1 \mid t^{(i)}=0, x^{(i)}=x) = 0,\\ & \forall x, ~~ p(y^{(i)} = 0 \mid t^{(i)}=0, x^{(i)}=x) = 1 \end{align*} where $\alpha \in (0,1)$ is some unknown scalar. In other words, if the unobserved ``true'' label $t^{(i)}$ is 1, then with $\alpha$ chance we observe a label $y^{(i)} = 1$. On the other hand, if the unobserved ``true'' label $t^{(i)} = 0$, then we always observe the label $y^{(i)} = 0$. Our final goal in the problem is to construct a binary classifier $h$ of the true label $t$, with only access to the partial label $y$. In other words, we want to construct $h$ such that $h(x^{(i)}) \approx p(t^{(i)} = 1\mid x^{(i)})$ as closely as possible, using only $x$ and $y$. \emph{Real world example: Suppose we maintain a database of proteins which are involved in transmitting signals across membranes. Every example added to the database is involved in a signaling process, but there are many proteins involved in cross-membrane signaling which are missing from the database. It would be useful to train a classifier to identify proteins that should be added to the database. In our notation, each example $x^{(i)}$ corresponds to a protein, $y^{(i)} = 1$ if the protein is in the database and $0$ otherwise, and $t^{(i)} = 1$ if the protein is involved in a cross-membrane signaling process and thus should be added to the database, and $0$ otherwise.} For the rest of the question, we will use the dataset and starter code provided in the following files: % \begin{center} \begin{itemize} \item \url{src/posonly/{train,valid,test}.csv} \item \url{src/posonly/posonly.py} \end{itemize} \end{center} % Each file contains the following columns: $x_1$, $x_2$, $y$, and $t$. As in Problem 1, there is one example per row. The $y^{(i)}$'s are generated from the process defined above with some unknown $\alpha$. \begin{enumerate} \input{posonly/01-train-t-labels} \ifnum\solutions=1 { \input{posonly/01-train-t-labels-sol} }\fi \input{posonly/02-train-y-labels} \ifnum\solutions=1 { \input{posonly/02-train-y-labels-sol} } \fi \input{posonly/03-bayes-warm-up} \ifnum\solutions=1 { \input{posonly/03-bayes-warm-up-sol} } \fi \input{posonly/04-constant} \ifnum\solutions=1 { \input{posonly/04-constant-sol} } \fi \input{posonly/05-estimate-alpha} \ifnum\solutions=1 { \input{posonly/05-estimate-alpha-sol} } \fi \input{posonly/06-plot} \ifnum\solutions=1 { \input{posonly/06-plot-sol} } \fi \end{enumerate} \textbf{Remark}: We saw that the true probability $p(t\mid x)$ was only a constant factor away from $p(y\mid x)$. This means, if our task is to only rank examples (\emph{i.e.} sort them) in a particular order (e.g, sort the proteins in order of being most likely to be involved in transmitting signals across membranes), then in fact we do not even need to estimate $\alpha$. The rank based on $p(y\mid x)$ will agree with the rank based on $p(t\mid x)$.