MNIST

History

Abstract

The MNIST dataset (Modified National Institute of Standards and Technology) has been very influential in machine learning and computer vision. It is an easy and popular dataset that has been used since it's inception in 1998 as a benchmark for Machine Learning Models. Historically it has enhanced the evolution of OCR (Optical Character Recognition) and assisted in the emergence of neural networks.

Origins

The story of MNIST begins with the NIST dataset, developed by the United States National Institute of Standards and Technology in the late 1980s. The original dataset was created to facilitate research in OCR systems, which were becoming increasingly relevant for automating tasks like check processing and mail sorting. NIST's dataset consisted of tens of thousands of handwritten digits collected from various sources, including Census Bureau employees and high school students.

Limitations

While the NIST dataset was groundbreaking, it had significant limitations. Its training and testing subsets were drawn from different populations (Census employees for training and high school students for testing), making it less ideal for machine learning research. The inconsistency led to concerns about its generalizability for broader applications.

Transformation into MNIST

In 1998, Yann LeCun, Corinna Cortes, and Christopher J.C. Burges modified the NIST dataset to address these issues, creating what is now known as MNIST. This new dataset was meticulously curated to ensure consistency and usability. The handwritten digits were size-normalized to 28x28 pixel grayscale images and centered within the frame. Additionally, the team restructured the dataset to provide a balanced and homogeneous split between training and testing subsets, each containing 60,000 and 10,000 examples, respectively.

The goal of the MNIST dataset was to provide a simple yet challenging benchmark for classification algorithms. It included clear definitions for training and testing splits, making it a reliable standard for evaluating model performance.

Classical Relevance

MNIST quickly became the "Hello World" of machine learning. Its accessibility, small size, and well-defined structure makes it an ideal starting point for researchers and practitioners to experiment with algorithms.

"drosophila of machine learning" —Geoffrey Hinton

The dataset's prominence coincided with the resurgence of neural networks in the late 1990s and early 2000s. Yann LeCun's work on Convolutional Neural Networks (CNNs), which achieved state-of-the-art results on MNIST, demonstrated the potential of deep learning in image recognition. Over the years, MNIST has inspired countless advancements, including dropout regularization, optimization techniques like Adam, and new architectures.

Relevance

While MNIST is no longer the pinnacle of challenging datasets, it remains a staple in education and introductory machine learning courses. Researchers have since moved on to more complex datasets like CIFAR-10, ImageNet, and the Kuzushiji-MNIST (a derivative focusing on Japanese characters), but MNIST's legacy persists as a foundational dataset that accelerated the growth of modern AI.

The MNIST dataset represents more than just a collection of images—it embodies the journey of machine learning from its early days to its current prominence. Its simplicity, historical significance, and enduring utility make it a cornerstone of the field.

Solving MNIST

I know I said I wouldn't do so here on the tag page, but this analysis would be repeated numerous times everywhere else anyways.

Pulling dataset

  from nltk import ConfusionMatrix
  from sklearn.datasets import fetch_openml
  mnist = fetch_openml('mnist_784', as_frame=False, parser='auto')
  # fetch_openml tries to give data in a pandas dataframe by default, hence the False
  X, y = mnist.data, mnist.target
  print(X.shape)
  print(y.shape)
(70000, 784)
(70000,)

Interpretting dataset

We have 70,000 labelled examples, hence we are doing supervised learning.

The dimensionality of our input data is 70,000 ROWS with 784 columns. Each row represents a distinct example, and then each column value represents a grayscale value between 0 and 255.

Note that the 2D 28x28 image has been flattened into a 784x1 array. We could also work with 3D tensors if we wanted to.

  print(X[0]) # 784x1 array
  reshaped = X[0].reshape(28,28)

  import matplotlib.pyplot as plt
  plt.imshow(reshaped, cmap='gray')
  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   3  18  18  18 126 136 175  26 166 255
   247 127   0   0   0   0   0   0   0   0   0   0   0   0  30  36  94 154
   170 253 253 253 253 253 225 172 253 242 195  64   0   0   0   0   0   0
     0   0   0   0   0  49 238 253 253 253 253 253 253 253 253 251  93  82
    82  56  39   0   0   0   0   0   0   0   0   0   0   0   0  18 219 253
   253 253 253 253 198 182 247 241   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0  80 156 107 253 253 205  11   0  43 154
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0  14   1 154 253  90   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0 139 253 190   2   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0  11 190 253  70   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  35 241
   225 160 108   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0  81 240 253 253 119  25   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0  45 186 253 253 150  27   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0  16  93 252 253 187
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0 249 253 249  64   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0  46 130 183 253
   253 207   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0  39 148 229 253 253 253 250 182   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0  24 114 221 253 253 253
   253 201  78   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0  23  66 213 253 253 253 253 198  81   2   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0  18 171 219 253 253 253 253 195
    80   9   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    55 172 226 253 253 253 253 244 133  11   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0 136 253 253 253 212 135 132  16
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0]
<matplotlib.image.AxesImage at 0x16d91eed0>

/tags/mnist/
mnist-5.png

Now clearly, this looks like a 5. Correspondingly its label is

print(y[0])
5

as expected.

Splitting the data

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Training

I have a multitude of options:

|->binary classifiers
| |-> svc (support vector classifier)
| |-> sgd (stochastic gradient descent classifier)
| |-> random forest
|->multinomial classifiers
| |-> logistic regression
| |-> random forests
| |-> gaussian nb

Binary Classifier

If I choose a binary classifier I either do:

OvR (One vs. Rest)

Make 10 classfiers; a 0-detector, a 1-detector, …, 9-detector. Then output the label which has the highest score.

AKA OvA (One vs. All)

OR I do:

OvO (One vs. One)

Since this is still Binary Classification and you want to compare each pairwise pair: 0 and 1, 0 and 2, 0 and 3, …, 8 and 9. You will have \(N\times(N-1)/2\) comparisons.

You can find the proof for this in my 23rd Bday Problems Solution Set, Q2. Here N = 10, thus we would require 45 Binary Classifiers to be trained.

To output a decision, you would output the classifier which won the most "duels".

An advantage of this method however, is that each classifier only needs to be trained on the subset of the data that contains those labels!

Multinomial

We shall opt for the Logistic Regression classifier here for the sake of ease:

  from sklearn.linear_model import LogisticRegression
  sm_mod = LogisticRegression(multi_class='multinomial',
				penalty='l2',
				C=50,
				solver='sag',
				tol=.001,
				max_iter=1000
				).fit(X_train, y_train)
  from sklearn.metrics import accuracy_score
  from sklearn.metrics import confusion_matrix
  print(f'Train Accuracy: {accuracy_score(sm_mod.predict(X_train), y_train)}')
  print(f'Test Accuracy: {accuracy_score(sm_mod.predict(X_test), y_test)}')
  print("Confusion Matrix: \n"+str(confusion_matrix(y_test, sm_mod.predict(X_test))))
  /opt/anaconda3/envs/metal/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1247: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. From then on, it will always use 'multinomial'. Leave it to its default value to avoid this warning.
    warnings.warn(
  Train Accuracy: 0.941
  Test Accuracy: 0.9231
  Confusion Matrix: 
  [[ 958    0    1    4    1    5    5    2    4    0]
   [   0 1112    8    2    0    1    3    1    8    0]
   [   4   11  920   18   11    5   12    9   39    3]
   [   3    2   18  924    2   22    3   10   20    6]
   [   2    3    5    4  916    0   10    5   10   27]
   [  11    5    3   38   11  763   15    7   34    5]
   [  10    3    9    2    7   17  908    1    1    0]
   [   3    7   23    8    6    1    0  945    2   33]
   [   7   13    5   23    6   24    7   13  864   12]
   [   8    6    1    9   23    6    0   23   12  921]]

Below here you will find other flavours of MNIST classification problems as well as other fitted models. I particularly enjoyed my MLP implementation which ascends this staircase from the XOR.

MNIST and FMNIST using KNN

Read more >

Multilayer Perceptron

We have seen what can be learned by the perceptron algorithm — namely, linear decision boundaries for binary classification problems.

It may also be of interest to know that the perceptron algorithm can also be used for regression with the simple modification of not applying an activation function (i.e. the sigmoid). I refer the interested reader to open another tab.

We begin with the punchline:

XOR

/projects/ml/dl/mlp/
xor.png
Not linearly separable in \(\mathbb{R}^2\)

Now clearly, taking a ruler, your finger or positioning any straight-lined object on the above figure will not enable you to separate the blue (true) from the red (false) circles. This was also one of Marvin Minsky's arguments against further development of the Perceptron in 1963. However, with the benefit of hindsight, we shall not retire so quickly, instead we add another layer of the neurons:

Read more >