\item \points{30} {\bf Incomplete, Positive-Only Labels}

In this problem we will consider training binary classifiers in situations
where we do not have full access to the labels. In particular, we consider
a scenario, which is not too infrequent in real life, where we have labels
only for a subset of the positive examples. All the negative examples and
the rest of the positive examples are unlabelled.

We formalize the scenario as follows. Let $\{(x^{(i)}, t^{(i)})\}_{i=1}^\nexp$ be a standard dataset of i.i.d distributed examples. Here $x^{(i)}$'s are the inputs/features and $t^{(i)}$ are the labels. Now consider the situation where $t^{(i)}$'s are not observed by us. Instead, we only observe the labels of some of the positive examples. Concretely, we assume that we observe  $y^{(i)}$'s that are generated by
\begin{align*}
& \forall  x, ~~ p(y^{(i)} = 1\mid t^{(i)}=1, x^{(i)}=x) = \alpha, \\
& \forall  x, ~~ p(y^{(i)} = 0 \mid t^{(i)}=1, x^{(i)}=x)  = 1- \alpha\\
& \forall  x, ~~ p(y^{(i)} = 1 \mid t^{(i)}=0, x^{(i)}=x) = 0,\\ 
& \forall  x, ~~ p(y^{(i)} = 0 \mid t^{(i)}=0, x^{(i)}=x) = 1
\end{align*}
where $\alpha \in (0,1)$ is some unknown scalar. In other words, if the unobserved ``true'' label $t^{(i)}$ is 1, then with $\alpha$ chance we observe a label $y^{(i)} = 1$. On the other hand, if the unobserved ``true'' label $t^{(i)} = 0$, then we always observe the label $y^{(i)} = 0$. 

Our final goal in the problem is to construct a binary classifier $h$ of
the true label $t$, with only access to the partial label $y$. In other words,
we want to construct $h$ such that
 $h(x^{(i)}) \approx p(t^{(i)} = 1\mid x^{(i)})$ as closely as
possible, using only $x$ and $y$.

\emph{Real world example: Suppose we maintain a database of proteins which
are involved in transmitting signals across membranes. Every example added to
the database is involved in a signaling process, but there are many proteins
involved in cross-membrane signaling which are missing from the database.
It would be useful to train a classifier to identify proteins that
should be added to the database. In our notation, each example $x^{(i)}$
corresponds to a protein, $y^{(i)} = 1$ if the protein is in the database and
$0$ otherwise, and $t^{(i)} = 1$ if the protein is involved in a cross-membrane
signaling process and thus should be added to the database, and $0$ otherwise.}


For the rest of the question, we will use the dataset and starter code provided in
the following files:
%
\begin{center}
\begin{itemize}
\item	\url{src/posonly/{train,valid,test}.csv}
\item   \url{src/posonly/posonly.py}
\end{itemize}
\end{center}
%
Each file contains the following columns: $x_1$, $x_2$, $y$, and $t$. As in
Problem 1, there is one example per row. The $y^{(i)}$'s are generated from the process defined above with some unknown $\alpha$.


\begin{enumerate}
        \input{posonly/01-train-t-labels}
        \ifnum\solutions=1 {
	  \input{posonly/01-train-t-labels-sol}
        }\fi

        \input{posonly/02-train-y-labels}
        \ifnum\solutions=1 {
	  \input{posonly/02-train-y-labels-sol}
        } \fi

        \input{posonly/03-bayes-warm-up}
        \ifnum\solutions=1 {
	  \input{posonly/03-bayes-warm-up-sol}
        } \fi

	\input{posonly/04-constant}
        \ifnum\solutions=1 {
	  \input{posonly/04-constant-sol}
        } \fi

	\input{posonly/05-estimate-alpha}
        \ifnum\solutions=1 {
	  \input{posonly/05-estimate-alpha-sol}
        } \fi

	\input{posonly/06-plot}
        \ifnum\solutions=1 {
	  \input{posonly/06-plot-sol}
        } \fi
\end{enumerate}

\textbf{Remark}: We saw that the true probability $p(t\mid x)$ was only a
constant factor away from $p(y\mid x)$. This means, if our task is to only rank
examples (\emph{i.e.} sort them) in a particular order (e.g, sort the proteins
in order of being most likely to be involved in transmitting signals across
membranes), then in fact we do not even need to estimate $\alpha$. The rank
based on $p(y\mid x)$ will agree with the rank based on $p(t\mid x)$.