Linear Regression

This is a page for closed-form and approximation methods to the Linear Regression problem.

The derivation will take the form of assuming a normal distribution on the conditional expectation \[\mathcal{E}[Y|X=x]\].

Closed form of OLE

Taking the derivative of $$\hat{\beta} = \arg \min_{\beta} \frac{1}{n}\|y-X \beta\|^2_2$$ and equating to 0: $$\begin{align*}\hat{\beta} &= \arg \min_{\beta} \frac{1}{n}\|y-X \beta\|^2_2\\ &= (X^TX)^{-1}X^T y\end{align*} $$

Example Pred

  import numpy as np
  # synthetic data for the rest of the linear models:
  np.random.seed(5)
  n = 100 # samples
  p = 5 # features
  sigma = 0.2 # std
  X = np.random.normal(0, 1, size=(n,p))
  beta_true = np.random.randint(-4, 2, p)
  noise = np.random.normal(0, sigma, size=(n))
  y = X @ beta_true + noise

  betahat = np.linalg.inv(X.T @ X) @ X.T @ y
  print("betahat: ", betahat)
  print("beta true:", beta_true)

betahat:  [-2.94946726  0.01589149 -2.004408   -3.97428268 -3.99637663]
beta true: [-3  0 -2 -4 -4]

Iterative Approach

An idea that permeates throughout all of Machine Learning is that

sometimes, you cannot explicitly solve the `argmin` formulation of the parameters in closed form

and other times, it may be possible, but just computationally infeasible

Gradient Descent

Stochastic Gradient Descent:

$$\hat{\beta}^{(k+1)} = \hat{\beta}^{(k)} - \eta\nabla_\beta L (\hat{\beta}^{(k)})$$

in which we iterate only over 1 sample at a time, and,

Batch Gradient Descent:

$$TODO$$

in which we iterate over the entire dataset.

Note, in practice you will choose mini-batch Gradient Descent, which is a mediation of both these approaches: $$TODO$$