link to mnist
MNIST
History
Abstract
The MNIST dataset (Modified National Institute of Standards and Technology) has been very influential in machine learning and computer vision. It is an easy and popular dataset that has been used since it's inception in 1998 as a benchmark for Machine Learning Models. Historically it has enhanced the evolution of OCR (Optical Character Recognition) and assisted in the emergence of neural networks.
Origins
The story of MNIST begins with the NIST dataset, developed by the United States National Institute of Standards and Technology in the late 1980s. The original dataset was created to facilitate research in OCR systems, which were becoming increasingly relevant for automating tasks like check processing and mail sorting. NIST's dataset consisted of tens of thousands of handwritten digits collected from various sources, including Census Bureau employees and high school students.
Limitations
While the NIST dataset was groundbreaking, it had significant limitations. Its training and testing subsets were drawn from different populations (Census employees for training and high school students for testing), making it less ideal for machine learning research. The inconsistency led to concerns about its generalizability for broader applications.
Transformation into MNIST
In 1998, Yann LeCun, Corinna Cortes, and Christopher J.C. Burges modified the NIST dataset to address these issues, creating what is now known as MNIST. This new dataset was meticulously curated to ensure consistency and usability. The handwritten digits were size-normalized to 28x28 pixel grayscale images and centered within the frame. Additionally, the team restructured the dataset to provide a balanced and homogeneous split between training and testing subsets, each containing 60,000 and 10,000 examples, respectively.
The goal of the MNIST dataset was to provide a simple yet challenging benchmark for classification algorithms. It included clear definitions for training and testing splits, making it a reliable standard for evaluating model performance.
Classical Relevance
MNIST quickly became the "Hello World" of machine learning. Its accessibility, small size, and well-defined structure makes it an ideal starting point for researchers and practitioners to experiment with algorithms.
"drosophila of machine learning" —Geoffrey Hinton
The dataset's prominence coincided with the resurgence of neural networks in the late 1990s and early 2000s. Yann LeCun's work on Convolutional Neural Networks (CNNs), which achieved state-of-the-art results on MNIST, demonstrated the potential of deep learning in image recognition. Over the years, MNIST has inspired countless advancements, including dropout regularization, optimization techniques like Adam, and new architectures.
Relevance
While MNIST is no longer the pinnacle of challenging datasets, it remains a staple in education and introductory machine learning courses. Researchers have since moved on to more complex datasets like CIFAR-10, ImageNet, and the Kuzushiji-MNIST (a derivative focusing on Japanese characters), but MNIST's legacy persists as a foundational dataset that accelerated the growth of modern AI.
The MNIST dataset represents more than just a collection of images—it embodies the journey of machine learning from its early days to its current prominence. Its simplicity, historical significance, and enduring utility make it a cornerstone of the field.
Solving MNIST
I know I said I wouldn't do so here on the tag page, but this analysis would be repeated numerous times everywhere else anyways.
Pulling dataset
from nltk import ConfusionMatrix
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', as_frame=False, parser='auto')
# fetch_openml tries to give data in a pandas dataframe by default, hence the False
X, y = mnist.data, mnist.target
print(X.shape)
print(y.shape)
(70000, 784) (70000,)
Interpretting dataset
We have 70,000 labelled examples, hence we are doing supervised learning.
The dimensionality of our input data is 70,000 ROWS with 784 columns. Each row represents a distinct example, and then each column value represents a grayscale value between 0 and 255.
Note that the 2D 28x28 image has been flattened into a 784x1 array. We could also work with 3D tensors if we wanted to.
print(X[0]) # 784x1 array
reshaped = X[0].reshape(28,28)
import matplotlib.pyplot as plt
plt.imshow(reshaped, cmap='gray')
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 18 18 18 126 136 175 26 166 255 247 127 0 0 0 0 0 0 0 0 0 0 0 0 30 36 94 154 170 253 253 253 253 253 225 172 253 242 195 64 0 0 0 0 0 0 0 0 0 0 0 49 238 253 253 253 253 253 253 253 253 251 93 82 82 56 39 0 0 0 0 0 0 0 0 0 0 0 0 18 219 253 253 253 253 253 198 182 247 241 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80 156 107 253 253 205 11 0 43 154 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 1 154 253 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 139 253 190 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 190 253 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35 241 225 160 108 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81 240 253 253 119 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45 186 253 253 150 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 93 252 253 187 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 249 253 249 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 46 130 183 253 253 207 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 148 229 253 253 253 250 182 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 114 221 253 253 253 253 201 78 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 66 213 253 253 253 253 198 81 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 171 219 253 253 253 253 195 80 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 55 172 226 253 253 253 253 244 133 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 136 253 253 253 212 135 132 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
<matplotlib.image.AxesImage at 0x16d91eed0>
Now clearly, this looks like a 5. Correspondingly its label is
print(y[0])
5
as expected.
Splitting the data
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Training
I have a multitude of options:
|->binary classifiers
| |-> svc (support vector classifier)
| |-> sgd (stochastic gradient descent classifier)
| |-> random forest
|->multinomial classifiers
| |-> logistic regression
| |-> random forests
| |-> gaussian nb
Binary Classifier
If I choose a binary classifier I either do:
OvR (One vs. Rest)
Make 10 classfiers; a 0-detector, a 1-detector, …, 9-detector. Then output the label which has the highest score.
AKA OvA (One vs. All)
OR I do:
OvO (One vs. One)
Since this is still Binary Classification and you want to compare each pairwise pair: 0 and 1, 0 and 2, 0 and 3, …, 8 and 9. You will have \(N\times(N-1)/2\) comparisons.
You can find the proof for this in my 23rd Bday Problems Solution Set, Q2. Here N = 10, thus we would require 45 Binary Classifiers to be trained.
To output a decision, you would output the classifier which won the most "duels".
An advantage of this method however, is that each classifier only needs to be trained on the subset of the data that contains those labels!
Multinomial
We shall opt for the Logistic Regression classifier here for the sake of ease:
from sklearn.linear_model import LogisticRegression
sm_mod = LogisticRegression(multi_class='multinomial',
penalty='l2',
C=50,
solver='sag',
tol=.001,
max_iter=1000
).fit(X_train, y_train)
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
print(f'Train Accuracy: {accuracy_score(sm_mod.predict(X_train), y_train)}')
print(f'Test Accuracy: {accuracy_score(sm_mod.predict(X_test), y_test)}')
print("Confusion Matrix: \n"+str(confusion_matrix(y_test, sm_mod.predict(X_test))))
/opt/anaconda3/envs/metal/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1247: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. From then on, it will always use 'multinomial'. Leave it to its default value to avoid this warning. warnings.warn( Train Accuracy: 0.941 Test Accuracy: 0.9231 Confusion Matrix: [[ 958 0 1 4 1 5 5 2 4 0] [ 0 1112 8 2 0 1 3 1 8 0] [ 4 11 920 18 11 5 12 9 39 3] [ 3 2 18 924 2 22 3 10 20 6] [ 2 3 5 4 916 0 10 5 10 27] [ 11 5 3 38 11 763 15 7 34 5] [ 10 3 9 2 7 17 908 1 1 0] [ 3 7 23 8 6 1 0 945 2 33] [ 7 13 5 23 6 24 7 13 864 12] [ 8 6 1 9 23 6 0 23 12 921]]
Below here you will find other flavours of MNIST classification problems as well as other fitted models. I particularly enjoyed my MLP implementation which ascends this staircase from the XOR.
We have seen what can be learned by the perceptron algorithm — namely, linear decision boundaries for binary classification problems.
It may also be of interest to know that the perceptron algorithm can also be used for regression with the simple modification of not applying an activation function (i.e. the sigmoid). I refer the interested reader to open another tab.
We begin with the punchline:
XOR

Now clearly, taking a ruler, your finger or positioning any straight-lined object on the above figure will not enable you to separate the blue (true) from the red (false) circles. This was also one of Marvin Minsky's arguments against further development of the Perceptron in 1963. However, with the benefit of hindsight, we shall not retire so quickly, instead we add another layer of the neurons: