Deep Learning

Kuzushiji MNIST

This page is for finding a classifier on the KMNIST dataset. This dataset is more challenging than the original MNIST dataset that I have previously solved.

The details of the dataset can be found in the associated paper.

In short, since the reformation of the Japanese education in 1868, there became a standardisation of the kanji characters, and in the present day, most Japanese people cannot read the texts from 150 years ago.

Read more >

Multilayer Perceptron

We have seen what can be learned by the perceptron algorithm — namely, linear decision boundaries for binary classification problems.

It may also be of interest to know that the perceptron algorithm can also be used for regression with the simple modification of not applying an activation function (i.e. the sigmoid). I refer the interested reader to open another tab.

We begin with the punchline:

XOR

xor.png
Not linearly separabel in \(\mathbb{R}^2\)

Now clearly, taking a ruler, your finger or positioning any straight-lined object on the above figure will not enable you to separate the blue from the red circles. This was also one of Marvin Minsky's arguments against further development of the Perceptron in 1963. However, with the benefit of hindsight, we shall not retire so quickly, instead we add another layer of the neurons:

Read more >

NanoGPT - Min with Teeth

Andrej Karpathy Video

Code

Pulling the dataset we will be working on:

  curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -o input.txt

Reading it into python

  with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

Data inspection

print("length of dataset in characters: ", len(text))
print("length of data: ", len(data))
length of dataset in characters:  1115394
length of data:  1115394
  print(text[:1000])
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.
  chars = sorted(list(set(text)))
  vocab_size = len(chars)
  print(''.join(chars))
  print(vocab_size)

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65

Tokeniser

  stoi = { ch:i for i,ch in enumerate(chars) }
  itos = { i:ch for i,ch in enumerate(chars) }
  encode = lambda s: [stoi[c] for c in s]
  # defines function taking in string, outputs list of ints
  decode = lambda l: ''.join([itos[i] for i in l])
  # input: list of integers, outputs string

  print(encode("hello world"))
  print(decode(encode("hello world")))
[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world
  import torch
  data = torch.tensor(encode(text), dtype=torch.long)
  print(data.shape, data.dtype)
  print(data[:1000])
torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 46, 47, 43, 44,  1, 43, 52, 43, 51, 63,
         1, 58, 53,  1, 58, 46, 43,  1, 54, 43, 53, 54, 50, 43,  8,  0,  0, 13,
        50, 50, 10,  0, 35, 43,  1, 49, 52, 53, 61,  5, 58,  6,  1, 61, 43,  1,
        49, 52, 53, 61,  5, 58,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58,
        47, 64, 43, 52, 10,  0, 24, 43, 58,  1, 59, 57,  1, 49, 47, 50, 50,  1,
        46, 47, 51,  6,  1, 39, 52, 42,  1, 61, 43,  5, 50, 50,  1, 46, 39, 60,
        43,  1, 41, 53, 56, 52,  1, 39, 58,  1, 53, 59, 56,  1, 53, 61, 52,  1,
        54, 56, 47, 41, 43,  8,  0, 21, 57,  5, 58,  1, 39,  1, 60, 43, 56, 42,
        47, 41, 58, 12,  0,  0, 13, 50, 50, 10,  0, 26, 53,  1, 51, 53, 56, 43,
         1, 58, 39, 50, 49, 47, 52, 45,  1, 53, 52,  5, 58, 11,  1, 50, 43, 58,
         1, 47, 58,  1, 40, 43,  1, 42, 53, 52, 43, 10,  1, 39, 61, 39, 63,  6,
         1, 39, 61, 39, 63,  2,  0,  0, 31, 43, 41, 53, 52, 42,  1, 15, 47, 58,
        47, 64, 43, 52, 10,  0, 27, 52, 43,  1, 61, 53, 56, 42,  6,  1, 45, 53,
        53, 42,  1, 41, 47, 58, 47, 64, 43, 52, 57,  8,  0,  0, 18, 47, 56, 57,
        58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 35, 43,  1, 39, 56, 43,  1,
        39, 41, 41, 53, 59, 52, 58, 43, 42,  1, 54, 53, 53, 56,  1, 41, 47, 58,
        47, 64, 43, 52, 57,  6,  1, 58, 46, 43,  1, 54, 39, 58, 56, 47, 41, 47,
        39, 52, 57,  1, 45, 53, 53, 42,  8,  0, 35, 46, 39, 58,  1, 39, 59, 58,
        46, 53, 56, 47, 58, 63,  1, 57, 59, 56, 44, 43, 47, 58, 57,  1, 53, 52,
         1, 61, 53, 59, 50, 42,  1, 56, 43, 50, 47, 43, 60, 43,  1, 59, 57, 10,
         1, 47, 44,  1, 58, 46, 43, 63,  0, 61, 53, 59, 50, 42,  1, 63, 47, 43,
        50, 42,  1, 59, 57,  1, 40, 59, 58,  1, 58, 46, 43,  1, 57, 59, 54, 43,
        56, 44, 50, 59, 47, 58, 63,  6,  1, 61, 46, 47, 50, 43,  1, 47, 58,  1,
        61, 43, 56, 43,  0, 61, 46, 53, 50, 43, 57, 53, 51, 43,  6,  1, 61, 43,
         1, 51, 47, 45, 46, 58,  1, 45, 59, 43, 57, 57,  1, 58, 46, 43, 63,  1,
        56, 43, 50, 47, 43, 60, 43, 42,  1, 59, 57,  1, 46, 59, 51, 39, 52, 43,
        50, 63, 11,  0, 40, 59, 58,  1, 58, 46, 43, 63,  1, 58, 46, 47, 52, 49,
         1, 61, 43,  1, 39, 56, 43,  1, 58, 53, 53,  1, 42, 43, 39, 56, 10,  1,
        58, 46, 43,  1, 50, 43, 39, 52, 52, 43, 57, 57,  1, 58, 46, 39, 58,  0,
        39, 44, 44, 50, 47, 41, 58, 57,  1, 59, 57,  6,  1, 58, 46, 43,  1, 53,
        40, 48, 43, 41, 58,  1, 53, 44,  1, 53, 59, 56,  1, 51, 47, 57, 43, 56,
        63,  6,  1, 47, 57,  1, 39, 57,  1, 39, 52,  0, 47, 52, 60, 43, 52, 58,
        53, 56, 63,  1, 58, 53,  1, 54, 39, 56, 58, 47, 41, 59, 50, 39, 56, 47,
        57, 43,  1, 58, 46, 43, 47, 56,  1, 39, 40, 59, 52, 42, 39, 52, 41, 43,
        11,  1, 53, 59, 56,  0, 57, 59, 44, 44, 43, 56, 39, 52, 41, 43,  1, 47,
        57,  1, 39,  1, 45, 39, 47, 52,  1, 58, 53,  1, 58, 46, 43, 51,  1, 24,
        43, 58,  1, 59, 57,  1, 56, 43, 60, 43, 52, 45, 43,  1, 58, 46, 47, 57,
         1, 61, 47, 58, 46,  0, 53, 59, 56,  1, 54, 47, 49, 43, 57,  6,  1, 43,
        56, 43,  1, 61, 43,  1, 40, 43, 41, 53, 51, 43,  1, 56, 39, 49, 43, 57,
        10,  1, 44, 53, 56,  1, 58, 46, 43,  1, 45, 53, 42, 57,  1, 49, 52, 53,
        61,  1, 21,  0, 57, 54, 43, 39, 49,  1, 58, 46, 47, 57,  1, 47, 52,  1,
        46, 59, 52, 45, 43, 56,  1, 44, 53, 56,  1, 40, 56, 43, 39, 42,  6,  1,
        52, 53, 58,  1, 47, 52,  1, 58, 46, 47, 56, 57, 58,  1, 44, 53, 56,  1,
        56, 43, 60, 43, 52, 45, 43,  8,  0,  0])
  n = int(0.9*len(data))
  train_data = data[:n]
  val_data = data[n:]

Understanding the context influence of n+1th token

  block_size = 8
  print(train_data[:block_size])
  x = train_data[:block_size]
  y = train_data[1:block_size+1]
  for t in range(block_size):
      context = x[:t+1]
      target = y[t]
      print(f"at input {context}\n" +
	    f"target {target}")
tensor([18, 47, 56, 57, 58,  1, 15, 47])
at input tensor([18])
target 47
at input tensor([18, 47])
target 56
at input tensor([18, 47, 56])
target 57
at input tensor([18, 47, 56, 57])
target 58
at input tensor([18, 47, 56, 57, 58])
target 1
at input tensor([18, 47, 56, 57, 58,  1])
target 15
at input tensor([18, 47, 56, 57, 58,  1, 15])
target 47
at input tensor([18, 47, 56, 57, 58,  1, 15, 47])
target 58

Note that within the block_size of 8, there are 8 total examples.

Read more >

Optical Character Recognition

OCR

This was one of the first times I fell in love with Machine Learning, without knowing that the magic came from Machine Learning methods.

I simply had a `pdf` file with handwritten or scanned text, and desperately wanted to find something within the document without having to do it manually.

I google online for an OCR; upload my file; download it back again, and hey presto — such magical, accurate results!

Read more >

Perceptron

This is tomorrow's work, but for now I just want to test some codeblocks:

pwd
pwd
.
├── LICENSE
├── README.md
├── archetypes
├── assets
├── backup
├── config
├── content
├── data
├── eslint.config.mjs
├── go.mod
├── go.sum
├── hugo-mod-json-resume
├── hugo-theme-gruvbox-inkscape.svg
├── hugo-theme-gruvbox.svg
├── hugo_stats.json
├── i18n
├── layouts
├── node_modules
├── package-lock.json
├── package.hugo.json
├── package.json
├── postcss.config.js
├── public
├── resources
├── static
└── theme.toml

14 directories, 13 files
./
├── _index.org
└── _index.org~
1 directory, 2 files
  #include <stdio.h>

  int main() {
    char a[5] = "hello";
    printf("%s world!\n", a);
    return 0;
  }
  
    cd /usr/local/etc
    cp php.ini php.ini.bak
    vi php.ini
  
  
    pwd
    /usr/home/chris/bin
    ls -la
    total 2
    drwxr-xr-x   2 chris  chris     11 Jan 10 16:48 .
    drwxr--r-x  45 chris  chris     92 Feb 14 11:10 ..
    -rwxr-xr-x   1 chris  chris    444 Aug 25  2013 backup
    -rwxr-xr-x   1 chris  chris    642 Jan 17 14:42 deploy
  
  
    Hello,
    World!

    Foo
    Bar
  

Perceptron

Origins

The perceptron learning algorithm is the most simple algorithm we have for Binary Classification.

It was introduced by Frank Rosenblatt in his seminal paper: "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" in 1958. The history however dates back further to the theoretical foundations of Warren McCulloch and Walter Pitts in 1943 and their paper "A Logical Calculus of the Ideas Immanent in Nervous Activity". The interested reader may visit these links for annotations and the original pdfs.

Read more >