Perceptron

Origins

The perceptron learning algorithm is the most simple algorithm we have for Binary Classification.

It was introduced by Frank Rosenblatt in his seminal paper: "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" in 1958. The history however dates back further to the theoretical foundations of Warren McCulloch and Walter Pitts in 1943 and their paper "A Logical Calculus of the Ideas Immanent in Nervous Activity". The interested reader may visit these links for annotations and the original pdfs.

The proposed algorithm guarantees us a binary separation of the data provided that it is linearly separable. This is a big assumption, but if it the nature of our data obeys this rule then this model will fit.

Marvin Minsky and Seymour Papert also reaffirmed the author's original convergence proofs in their book "Perceptrons", but snuffed out the method's momentum by demonstrating its inability to solve the XOR and thus potentially even more difficult problems. We would later learn that XOR could be solved with a multi-layered perceptron.

We can see in the diagram below what Rosenblatt's theoretical perceptron looked like, how it behaved and how it resembled our own human biological neurons:

Basically, a number of neurons would connect into our neuron, and if the summationed value was greater than some threshold, that neuron in question would fire. Mathematically we can see that above with the equation \(w_1x_1 + w_2x_2 + ... + w_nx_n > w_0\). If this is true then the step-function returns 1, otherwise -1.

Here is a diagram showing how the synapses of one neuron connect to the dendrites of another. Then if enough dendrites send charges through the synaptic terminals of another neuron, then that neuron will fire!

The Algorithm

Explanation

This algorithm is just a tersely condensed version of the prose:

Keep updating your weights¹ according to the following rule until all your examples are correctly classified:

\(w_{t} = w_{t-1} + \eta y_{t-1} \mathbf{x_{t-1}}\), where \(t\) is time, \(\eta\) is your learning hyperparameter, \(y_{t-1}\) is the (incorrect) outcome of the last time step, and the data vector \(\mathbf{x_{t-1}}\) is the corresponding inputs of that same previous time step.

Example 1

We can step through a small example manually too:

Given the following data, we wish to create a classifier that successfully separates the data. We can then do our own inferences when given a new point such as \((5,2)\) for example.

example	x1	x2	class
a.	0	1	-1
b.	2	0	-1
c.	1	1	+1

Now of course a purely geometric solution can be constructed for this, and we can spin one up in a moment to match our arithmetic with; but first let us update the weights manually and see how many iterations of this algorithm are required to find such a solution.

  import numpy as np

  #data; note the 1's forming the bias component of our model
  a = np.array([1,0,1,-1])
  b = np.array([1,2,0,-1])
  c = np.array([1,1,1,+1])
  X = np.array([a,b,c]) # design matrix
  eta = 1.0

  #w = np.array([-1.5, 0, 2])
  w = np.zeros(3)

  epoch = 0
  errors = len(w)
  while errors > 0:
    print(f"Epoch: {epoch}")
    errors = len(w)
    for x in X:
      # calculate s
      s = x[0] * w[0] + x[1] * w[1] + x[2] * w[2]
      if (s * x[3] > 0): # checks that sign is the same.
        errors -= 1
      else: # updates weight vector.
        #w = w + eta * x[:3] * x[3]
        if x[3] > 0:
          w = np.array([j+eta*x[i] for i,j in enumerate(w)])
        else:
          w = np.array([j-eta*x[i] for i,j in enumerate(w)])
        print(w)
    epoch = epoch + 1;

  Epoch: 0
  [-1.  0. -1.]
  [0. 1. 0.]
  Epoch: 1
  [-1.  1. -1.]
  [-2. -1. -1.]
  [-1.  0.  0.]
  Epoch: 2
  [0. 1. 1.]
  Epoch: 3
  [-1.  1.  0.]
  [-2. -1.  0.]
  [-1.  0.  1.]
  Epoch: 4
  [-2.  0.  0.]
  [-1.  1.  1.]
  Epoch: 5
  [-2.  1.  0.]
  [-3. -1.  0.]
  [-2.  0.  1.]
  Epoch: 6
  [-1.  1.  2.]
  Epoch: 7
  [-2.  1.  1.]
  [-3. -1.  1.]
  [-2.  0.  2.]
  Epoch: 8
  [-3.  0.  1.]
  [-2.  1.  2.]
  Epoch: 9
  [-3.  1.  1.]
  [-2.  2.  2.]
  Epoch: 10
  [-3.  2.  1.]
  [-4.  0.  1.]
  [-3.  1.  2.]
  Epoch: 11
  [-2.  2.  3.]
  Epoch: 12
  [-3.  2.  2.]
  [-4.  0.  2.]
  [-3.  1.  3.]
  Epoch: 13
  [-4.  1.  2.]
  [-3.  2.  3.]
  Epoch: 14
  [-4.  2.  2.]
  [-5.  0.  2.]
  [-4.  1.  3.]
  Epoch: 15
  [-3.  2.  4.]
  Epoch: 16
  [-4.  2.  3.]
  [-5.  0.  3.]
  [-4.  1.  4.]
  Epoch: 17
  [-5.  1.  3.]
  [-4.  2.  4.]
  Epoch: 18
  [-5.  2.  3.]
  [-4.  3.  4.]
  Epoch: 19
  [-5.  3.  3.]
  [-6.  1.  3.]
  [-5.  2.  4.]
  Epoch: 20

We see that the algorithm took 20 epochs (loops through the whole dataset - arrays a, b and c) to converge to the correct² solution weights. You will notice first that not every epoch has the same number of weight updates, and you will also notice that this algorithm differs from the one outlined in pseudocode. I did this because this code is more verbose and clear to me. In a moment though we shall go and refactor the code with vectorised operations in favour of naïve list comprehensions.

For now let us choose a better weight initialisation vector and see how that affects our epoch count:

  w = np.array([-1.5, 0, 2])

  Epoch: 0
  [-2.5  0.   1. ]
  [-1.5  1.   2. ]
  Epoch: 1
  [-2.5  1.   1. ]
  [-1.5  2.   2. ]
  Epoch: 2
  [-2.5  2.   1. ]
  [-3.5  0.   1. ]
  [-2.5  1.   2. ]
  Epoch: 3

How fascinating. A reduction by a factor of 6 as a reward for a small amount of a priori information!

Visuals: Matching our Euclidean Intuition

Now let us chalk up the plot and see if our algorithm's final weights produce the same decision boundary as a human would.

  import matplotlib.pyplot as plt
  %matplotlib inline

  x1_neg = X[X[:,3] == -1][:,1]
  x2_neg = X[X[:,3] == -1][:,2]
  x1_pos = X[X[:,3] == +1][:,1]
  x2_pos = X[X[:,3] == +1][:,2]

  x_vals = np.linspace(-1,3,100)
  x2 = -(w[0]+w[1]*x_vals) / w[2]
  # ^derived from rearranging w0 + w1x1 + w2x2 = 0 in terms of x2

  plt.figure(figsize=(8,6))
  plt.scatter(x1_neg, x2_neg, color='red', label='negative class')
  plt.scatter(x1_pos, x2_pos, color='blue', label='positive class')

  plt.plot(x_vals, x2, color='green', label='decision boundary')
  plt.plot(x_vals, -1/2*x_vals+1.5, color='grey', linestyle='--', label='upper')
  plt.plot(x_vals, -1/2*x_vals+1.0, color='grey', linestyle='--', label='lower')
  plt.axvline(0, color='black')
  plt.axhline(0, color='black')
  plt.xlabel('x1')
  plt.ylabel('x2')
  plt.legend()
  plt.title("Perceptron on Euclidean Plane")
  plt.grid()
  plt.show()

Interpretting this, we see that indeed if we were draw two slopes ourselves (upper and lower), and then slice that in half we would get exactly the decision boundary that the perceptron found. We will now run the same code on a slightly more complicated example to see that this is not always true.

Example 2: More \(X_i\)'s; still 2D

example	x1	x2	class
a.	-2	-1	-1
b.	2	-1	+1
c.	1	1	+1
d.	-1	-1	-1
e.	3	2	1

Method Extraction:

At this point, we keep reusing the same code and so let us refactor the rogue perceptron code into a more disciplined class:

  class Perceptron:
    def __init__(self, eta=1.0, max_iter=100):
      self.eta = eta
      self.max_iter = max_iter
      self.weights = None

    def fit(self, X):
      #self.weights = np.zeros(X.shape[1]-1) #initialise weight to 0's
      #self.weights = np.array([-1.5,0,2])
      self.weights = np.array([5.0,1.0,1.0])
      num_samples = X.shape[0]
      iteration = 0

      while iteration < self.max_iter:
	errors = 0
	for sample in X:
          bias, x1, x2, y = sample
          s = np.dot(self.weights, [bias, x1, x2])

          if s * y <= 0:
            errors += 1
            update = self.eta * y * np.array([bias, x1, x2])
            self.weights += update

	print(f"Epoch {iteration}: Weights={self.weights}")
	if errors == 0:
          break #converged!
	iteration += 1
      if iteration == self.max_iter:
	print("Reached maximum iterations without convergence.")


    def predict(self, X):
      if self.weights is None:
	raise ValueError("Model not trained yet. Call fit method first!")
      X_with_bias = np.hstack((np.ones((X.shape[0],1)),X))
      return np.sign(X_with_bias @ self.weights)

We can quickly sanity test on our inputs from our last perceptron:

  p = Perceptron()
  p.fit(X)
  print(f"weights: {p.weights}")

Epoch 0: Weights=[-0.5  2.   1. ]
Epoch 1: Weights=[-0.5  2.   1. ]
weights: [-0.5  2.   1. ]

Changes

In refactoring our code we have also made some upgrades:

switched to using a dot product
error checking
inverted the logic to increment errors and update weights only on that if branch
migrated to measuring by epochs: 1 iteration over all of the examples
made our code more reusable.

Solving Table 2:

  a = np.array([1, -2, -1, -1])
  b = np.array([1,  2, -1, +1])
  c = np.array([1,  1,  1, +1])
  d = np.array([1, -1, -1, -1])
  e = np.array([1,  3,  2, +1])
  big_X = np.array([a,b,c,d,e])
  big_p = Perceptron()
  big_p.fit(big_X)
  print(f"weights: {big_p.weights}")

Epoch 0: Weights=[4. 3. 2.]
Epoch 1: Weights=[4. 3. 2.]
weights: [4. 3. 2.]

Discrepancies:

Observe now that a different learning rate \(\eta\) yields us a different line:

  big_p_new_eta = Perceptron(eta=0.4)
  big_p_new_eta.fit(big_X)
  print(f"weights: {big_p_new_eta.weights}")

Epoch 0: Weights=[4.2 2.2 1.8]
Epoch 1: Weights=[3.8 2.6 2.2]
Epoch 2: Weights=[3.8 2.6 2.2]
weights: [3.8 2.6 2.2]

Plots

Ultimately we have multiple, imperfect solutions to the same problem.³

Let us add another method to our existing Perceptron class by leveraging some OOP:

  class PerceptronWithPlot(Perceptron):
    def plot_decision_boundary(self, X):
      if self.weights is None:
	raise ValueError("Model has not been trained. Call the fit method first!")

      # extracting range for plot.
      x_min, x_max = np.min(X[:, 1]), np.max(X[:, 1])
      y_min, y_max = np.min(X[:, 2]), np.max(X[:, 2])

      x_vals = np.linspace(x_min, x_max, 100)
      y_vals = -(self.weights[0] + self.weights[1] * x_vals) / self.weights[2]

      plt.figure(figsize=(8,6))
      for sample in X:
	bias, x1, x2, y = sample
	plt.scatter(x1,x2,c='red' if y == -1 else 'blue', s = 100)

      plt.plot(x_vals, y_vals, 'k--', label="Decision Boundary")
      plt.xlabel("x1")
      plt.ylabel("x2")
      plt.grid()
      plt.legend()
      plt.show()

  model_eta10 = PerceptronWithPlot(eta=1.0)
  model_eta01 = PerceptronWithPlot(eta=0.1)
  model_eta10.fit(big_X)
  model_eta01.fit(big_X)
  model_eta10.plot_decision_boundary(X)
  model_eta01.plot_decision_boundary(X)

Clearly we can see the difference between the different choices of hyperparameters for this algorithm: the initial weight vector, as well as the eta learning rate.

Iterative Eye Candy

Let us put the nail in this page's coffin and do justice to the iterative nature of this algorithm:

  import matplotlib.pyplot as plt
  def generate_data(n=5, means=[[3,3],[-1,1]], seed=1):
    np.random.seed(seed)
    m1=np.array(means[0])
    m2=np.array(means[1])
    S1 = np.random.rand(2,2)
    S2 = np.random.rand(2,2)
    dist_01 = np.random.multivariate_normal(m1, S1.T @ S1, n)
    dist_02 = np.random.multivariate_normal(m2, S2.T @ S2, n)
    X = np.concatenate((np.ones(2*n).reshape(-1,1),
                        np.concatenate((dist_01,dist_02))),axis=1)
    y = np.concatenate((np.ones(n), -1*np.ones(n))).reshape(-1,1)
    shuffle_idx = np.random.choice(2*n,size=2*n,replace=False)
    X = X[shuffle_idx]
    y = y[shuffle_idx]
    return X, y

  def plot_perceptron(ax, X, y, w):
    pos_points = X[np.where(y==1)[0]]
    neg_points = X[np.where(y==-1)[0]]
    ax.scatter(pos_points[:,1],pos_points[:,2],color='blue')
    ax.scatter(neg_points[:,1],neg_points[:,2],color='red')
    xx = np.linspace(-6,6)
    yy = -w[0]/w[2] - w[1]/w[2] * xx
    ax.plot(xx,yy,color='orange')

    ratio = (w[2]/w[1] + w[1]/w[2])
    xpt = (-1*w[0] / w[2]) * 1/ratio
    ypt = (-1*w[0] / w[1]) * 1/ratio

    ax.arrow(xpt,ypt,w[1],w[2],head_width=0.2, color='orange')
    ax.axis('equal')

  def train_perceptron_for_vis(X,y,max_iter=100):
    np.random.seed(69)
    w = np.random.random(3)
    ctr = 0
    for _ in range(max_iter):
      yXw = (y*X)@w.T
      mistake_idxs = np.where(yXw <= 0)[0]
      if mistake_idxs.size > 0:
        ctr += 1
        i = np.random.choice(mistake_idxs)
        w = w + y[i] * X[i]

        fig,ax = plt.subplots()
        plot_perceptron(ax,X,y,w)
        plt.show()
        print(f"Iteration {ctr}: w = {w}")

    fig,ax = plt.subplots()
    plot_perceptron(ax,X,y,w)
    plt.show()
    print(f"Iteration {ctr}: w = {w}")
    return

  X,y=generate_data(n=20,means=[[-1,-1],[1,2]],seed=204)
  train_perceptron_for_vis(X,y)

Credit: Omar Al Ghattas.

Iteration 1: w = [-0.70375084  1.128485   -0.09833516]
Iteration 2: w = [ 0.29624916  1.49381456 -0.89032827]
Iteration 3: w = [ 1.29624916 -0.03485745 -1.90008542]
Iteration 4: w = [ 0.29624916  0.6849623  -2.38995155]
Iteration 5: w = [ 1.29624916  0.94561468 -1.81027165]
Iteration 6: w = [ 0.29624916  1.26503197 -2.25885934]
Iteration 7: w = [ 1.29624916 -1.14315819 -3.20003627]
Iteration 8: w = [ 2.29624916 -0.88250582 -2.62035637]
Iteration 9: w = [ 1.29624916 -0.16268606 -3.1102225 ]
Iteration 10: w = [ 2.29624916  0.09796632 -2.5305426 ]
Iteration 11: w = [ 1.29624916  0.81778608 -3.02040872]
Iteration 12: w = [ 2.29624916  1.07843845 -2.44072883]
Iteration 13: w = [ 1.29624916  1.79825821 -2.93059495]
Iteration 14: w = [ 2.29624916 -0.60993195 -3.87177188]
Iteration 15: w = [ 1.29624916 -0.01559403 -4.51976735]
Iteration 16: w = [ 2.29624916  0.24505834 -3.94008746]
Iteration 17: w = [ 1.29624916  0.9648781  -4.42995358]
Iteration 18: w = [ 2.29624916  1.22553048 -3.85027368]
Iteration 19: w = [ 1.29624916  1.54494777 -4.29886137]
Iteration 20: w = [ 2.29624916  1.80560014 -3.71918147]
Iteration 21: w = [ 1.29624916  2.12501743 -4.16776915]
Iteration 22: w = [ 2.29624916  2.38566981 -3.58808926]
Iteration 23: w = [ 3.29624916 -0.02252035 -4.52926619]
Iteration 24: w = [ 2.29624916  0.29689694 -4.97785387]
Iteration 25: w = [ 3.29624916  0.55754931 -4.39817398]
Iteration 26: w = [ 2.29624916  1.27736907 -4.8880401 ]
Iteration 27: w = [ 3.29624916  1.53802145 -4.30836021]
Iteration 28: w = [ 2.29624916  1.85743874 -4.75694789]
Iteration 28: w = [ 2.29624916  1.85743874 -4.75694789]

Conclusion

Clearly this is an entertaining and simple binary classifier that just works. But beyond historical homage this technique does not really flourish in our present-day world of Transformers, CNN's and Stable Diffusion models. As such this was merely a starting point for our adventure. Next in the series we will see what surrenders to the MLP (multi-layered perceptron); we will also learn how to consistently find the best linear decision boundary with SVM's (Support Vector Machines) and then extend this by kernelising applying the linear SVM algorithm to even non-linear data!

Footnotes

or adjusting your hyperplane

we will check the correctness in the next section.

an inevitable future post on the optimum and kernelisable SVM (support vector machine) is imminent - stay tuned.