Hola!

As part of the Univ.ai KTF-1 course, I created this solution for one of the project questions.

Write a classifier that distinguishes the number 8 & number 3.

Gee. This might seem like a really simple question as first.

"Well, let's use scikit learn and all the other libraries we have and create a classifier in 3 lines."

But no. Let's use the Kudzu library - which I coded from scratch - where all the implementation for running a neural network is written from scratch.

Credits to Rahul (& of course, Joe Grus) for wonderfully showing how simple a neural network can be coded.

Throughout this project, I make use of the Kudzu libray to run the models.

Towards the end, you can also see a visualization of how the computer "splits" 3 & 8.

Some of the things I have altered in the Kudzu library:Added extra functions in the Callbacks call; Added tqdm for visualizing progress; Added softmax activation and softmax prime function.

Preparing the Data

"What is MNIST anyway?"

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

It was created by "re-mixing" the samples from NIST's original datasets.

The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments.

Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

The MNIST database contains 60,000 training images and 10,000 testing images.

Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset

Source:Wikipedia

Let's take a look at how this data looks like.

#collapse
image_index = 7776 # You may select anything up to 60,000
print(train_labels[image_index]) 
plt.imshow(train_images[image_index], cmap='Greys')

2

<matplotlib.image.AxesImage at 0x1894e46f6c8>

Filter data to get 3 and 8 out

Now in order to create our classifier, we must filter our 3s and 8s from the lot.

#filtering the 3 and 8 data - splitting it into test and train
train_filter = np.where((train_labels == 3 ) | (train_labels == 8))
test_filter = np.where((test_labels == 3) | (test_labels == 8))
X_train, y_train = train_images[train_filter], train_labels[train_filter]
X_test, y_test = test_images[test_filter], test_labels[test_filter]

We normalize the pizel values in the 0 to 1 range

X_train = X_train/255.
X_test = X_test/255.

And setup the labels as 1 (when the digit is 3) and 0 (when the digit is 8)

y_train = 1*(y_train==3)
y_test = 1*(y_test==3)

X_train.shape, X_test.shape

((11982, 28, 28), (1984, 28, 28))

We reshape the data to flatten the image pixels into a set of features or co-variates:

X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)
X_train.shape, X_test.shape

((11982, 784), (1984, 784))

y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
y_train.shape, y_test.shape

((11982, 1), (1984, 1))

So how does this work?

In regression, we saw our model outputting numerical values.

But in the case of logistic regression, our model will instead be spewing out probability values of the given input belonging to a certain class.

This is based on the premise the the input space can in fact be seperated into two seperate 'regions'(in the case of 2 classes) by a linear boundary.

Ideally, in the case of two dimensions, our linear boundary would be a line. For three dimensions, it would end up being a plane. For higher dimensions, we would call this boundary as a 'hyperplane'.

$$ \begin{eqnarray} y = 1 & ,\,\, & {w}\cdot{x} + b \ge 0\\ y = 0 & ,\,\, & {w}\cdot{x} + b < 0 \end{eqnarray} $$

Consider the sigmoid function:

$$h(z) = \frac{1}{1 + e^{-z}}.$$

with the identification:

$$z = {w}\cdot{x} + b$$

At $z=0$ this function has the value 0.5.

If $z > 0$, $h > 0.5$ and as $z \to \infty$, $h \to 1$.

If $z < 0$, $h < 0.5$ and as $z \to -\infty$, $h \to 0$.

Through Maximum Likelihood estimation, we can then arrive at our log likelihood.

$$L = P(y \mid {x},{w}).$$

$$l = log L = log(P(y \mid {x},{w})).$$

Thus

$$\begin{eqnarray} l &=& log\left(\prod_{y_i \in \cal{D}} h({w}\cdot{x_i} + b)^{y_i} \left(1 - h({w}\cdot{x_i} + b) \right)^{(1-y_i)}\right)\\ &=& \sum_{y_i \in \cal{D}} log\left(h({w}\cdot{x_i} + b)^{y_i} \left(1 - h({w}\cdot{x_i} + b) \right)^{(1-y_i)}\right)\\ &=& \sum_{y_i \in \cal{D}} log\,h({w}\cdot{x_i} + b)^{y_i} + log\,\left(1 - h({w}\cdot{x_i} + b) \right)^{(1-y_i)}\\ &=& \sum_{y_i \in \cal{D}} \left ( y_i log(h({w}\cdot{x_i} + b)) + ( 1 - y_i) log(1 - h({w}\cdot{x_i} + b)) \right ) \end{eqnarray}$$

The negative of this log likelihood (henceforth abbreviated NLL), is also called the Binary Cross-Entropy aka Negative Log likelihood.

$$ NLL = - \sum_{y_i \in \cal{D}} \left ( y_i log(h({w}\cdot{x})) + ( 1 - y_i) log(1 - h({w}\cdot{x})) \right )$$

Neural Network

For our neural network, we need to use this cost function in order to find the gradient and perform gradient descent.

The general sequence of how data is passed through a neural network layers.

These layers are mostly of the sequence:

Linear transformation or affine layer
Non-linear transformation or activation layer
Cost function calculation

This is known as forward propogation.

In the case of backward propogation, the cost function is used to calculate the gradient and find the direction in which to move towards the global minima of loss.

This is done by going in the reverse or backward phase - we now propagate the derivatives backward through the layer cake.

The weights are then updated through means of the derivatives calculated by the chain rule.

It is also important to note that a learning rate is initialized so as to control the "length" of the step towards the minima.

If the step is too long, you might end up jumping to the next mountain.

If your step is too short, you might end up camping at night for three days, almost getting mauled by bears, before you reach the minima.

1. I use the following configuration (or similar) to set up the model for training.

class Config:
    pass
config = Config()
config.lr = 0.001
config.num_epochs = 200
config.bs = 50

Now construct a model which has the following layers

A first affine layer which has 784 inputs and does 100 affine transforms. These are followed by a Relu
A second affine layer which has 100 inputs from the 100 activations of the past layer, and does 100 affine transforms. These are followed by a Relu
A third affine layer which has 100 activations and does 2 affine transformations to create an embedding for visualization. There is no non-linearity here.
A final "logistic regression" which has an affine transform from 2 inputs to 1 output, which is squeezed through a sigmoid.

train_data = Data(X_train, y_train)
loss = BCE()
opt = GD(config.lr)
sampler = Sampler(train_data, config.bs, shuffle=True)
train_dl = Dataloader(train_data, sampler)

layers = [Affine("first", 784, 100), Relu("relu"), Affine("second", 100, 100), Relu("relu"), Affine("third", 100, 2), Affine("output", 2, 1), Sigmoid("sigmoid")]
model = Model(layers)

2. Create a callback class

I further added code to the AccCallback class to add in the following functionalities.

Initialized it to have accuracy arrays

self.accuracies = []
self.test_accuracies = []

Then at the end of each epoch, calculated the probabilities and hence predictions on both the training set and the test set. Printed these out once per epoch.
Acumulated these in the above array. This will require me to keep track of all 4 training and test sets.
I edited the Learner for this or pass these sets in some kind of data object to the callback.
I also added tqdm functionality in the AccCallback class in order to visually see the batch running progress per epoch.

learner = Learner(loss, model, opt, config.num_epochs)
acc = AccCallback(learner, config.bs)
learner.set_callbacks([acc])

3. Train the model

Now that the training has completed, let's take a look at how our model performed.

#collapse
plt.figure(figsize=(14,7))
plt.plot(acc.accuracies, label = "Training Accuracy", linewidth = "3")
plt.plot(acc.test_accuracies, label = "Validation Accuracy", linewidth = "3")
plt.legend(loc='upper left')
plt.grid('both', alpha = 0.95, linestyle="--")
plt.title("Training Accuracy vs. Validation Accuracy");

Not too bad!

Usually when the epochs progress, if the training accuracy and validation accuracy differ widely and move away from each other, then it is a sign of overfitting.

In our case, there is a small "moving away" happening. But since it is not so singificant, we are allowed to call it a good performance.

The intuition behind validation accuracy and training accuracy is that if your model ends up learning too well, then it wouldn't be able to generalize for any new data it sees. This is ,in fact, what we call overfitting.

#collapse
plt.figure(figsize=(14,7))
plt.plot(acc.losses, linewidth = "3", color = "b")
plt.grid('both', alpha = 0.95, linestyle="--")
plt.title("Losses over Epochs");

The graph above is the loss we see over the epochs. Clearly, the model is learning to approach the global minima through gradient descent.

(although it is important to note that we may not have reached it completely)

#collapse
plt.figure(figsize=(14,7))
plt.plot(acc.batch_losses, color = "r")
plt.grid('both', alpha = 0.95, linestyle="--")
plt.title("Batch Losses over Epochs");

4. Plot the results

Our accuracy is 99.2%. Nice!

In order to see this more clearly, tn the embedding space ( the inputs of the "logistic regression") I am now plotting the data points by running them forward through the network.

By coloring coding them with their actual class and plotting the probability contours (these will all be lines in the embedding space as from here on, this is a logistic regression), we can bettersee the points stranded on the "wrong" side of the probability 1/2 line.

I have also plotted the predictions against the actual values showing where they are in the embedding space.

I have also tried doing a 3D plot to see the Sigma shape of the logistics working. (X - first output from the affine layer; Y- second output from the affine layer; Z - the probability of the X,Y data)

#collapse
actual_inputs = np.concatenate((X_train, X_test))
actual_outputs = np.concatenate((y_train, y_test))
predicted_outputs =  1*(model(actual_inputs) >= 0.5)

#collapse
print("Your model accuracy is "+ str(np.mean(actual_outputs == predicted_outputs) * 100) +'.')

Your model accuracy is 99.10496921094085.

#collapse
#hide

#getting the last two layers for plotting
prob_layer = [Affine("output", 2, 1), Sigmoid("sigmoid")]
prob_model = Model(prob_layer)
prob_model.layers[0].params['w'] = model.layers[5].params['w']
prob_model.layers[0].params['b'] = model.layers[5].params['b']

#collapse
visualize_model_1 = Model(layers[:-2])
plot_testing = visualize_model_1(X_test)

#collapse
plt.figure(figsize=(14,7))
plt.scatter(plot_testing[:,0], plot_testing[:,1], c = y_test.ravel());
plt.title("Visualization of the two classes of data in 2D space");

#collapse
#this would be our probability layer (last two layers)
model_prob = Model(layers[-2:])

#collapse
xgrid = np.linspace(-4, 1, 100) 
ygrid = np.linspace(-7.5, 7.5, 100) 
xg, yg = np.meshgrid(xgrid, ygrid)
# xg and yg are now both 100X100 -> we need to conver them to single arrays 

xg_interim = np.ravel(xg)
yg_interim = np.ravel(yg)

# xg_interim, yg_interim are now arrays of len 10000, now we will stack them and then transpose to get desired shape of n rows, 2 columns

X_interim = np.vstack((xg_interim, yg_interim)) 
X = X_interim.T

# We want a shape of n rows and 2 columns in order to be able to feed this to last affine
# This last affine takes only two columns, hence the above transformation
probability_contour = model_prob(X).reshape(100,100)

#collapse
plt.figure(figsize=(14,7))
plt.scatter(plot_testing[:,0], plot_testing[:,1], c = y_test.ravel())
contours = plt.contour(xg,yg,probability_contour)
plt.clabel(contours, inline = True );
plt.title("Probability Contour that distinguishes one data class from the other");

#collapse
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import math
from matplotlib import cm

xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = prob_model(grid)


#contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu", vmin=0, vmax=1)

fig = plt.figure(figsize=(15, 10))
ax = plt.axes(projection='3d')
ax.scatter(model_points_3, visualization_points_3.T[0], visualization_points_3.T[1],linewidth=0, c='r')
ax.scatter(model_points_8, visualization_points_8.T[0], visualization_points_8.T[1],linewidth=0, c='b')
plt.title("3D overview of Input data  wrt to its Sigma value");
#ax.plot_surface(xx, yy, z, alpha=0.2)
#ax.view_init(elev=90, azim=75)

Super cool!

Let's now do a simple Logistic Regression Model and compare it with the Neural Network we had constructed.

It would be pretty interesting to see the difference in outcomes.

#collapse
plt.figure(figsize=(14,7))
plt.plot(acc.test_accuracies, 'g-', label = "Val Accuracy - NN")
plt.plot(acc.accuracies, 'r-', label = "Training Accuracy - NN")
plt.plot(acc2.test_accuracies, 'b-',linestyle="--", label = "Val Accuracy - Logistic Reg")
plt.plot(acc2.accuracies, 'c-',linestyle="--", label = "Training Accuracy - Logistic Reg")
plt.ylim(0.8,1) ## for a more spread out view
plt.legend()
plt.title("Comparing NN accuracies & Logistic Regression Model accuracy");

#collapse
actual_inputs = np.concatenate((X_train, X_test))
actual_outputs = np.concatenate((y_train, y_test))
predicted_outputs =  1*(model_logistic(actual_inputs) >= 0.5)

print("Your model accuracy is "+ str(np.mean(actual_outputs == predicted_outputs) * 100) +'.')

Your model accuracy is 96.28383216382643.

...And the winner is our lovely Neural Network model!

Looking at this graph, clearly the Neural network performed better than our simple logistics model.

Beauty of Neural Networks

Now that we've seen both these different types of models performing, you might have the question - why and how is it doing what it does?

Well, Neural networks are absolutely brilliant in the way they work - and brilliance doesn't really need to equate itself with being increasingly complex. With the right mindset and understanding, you will come to realize the beauty in the simplicity of how it works.

As a matter of fact, the Universal Approximation Theorem ,to me, is the theoretical foundation of why neural networks work.

It states that a neural network with one hidden layer containing a sufficient but finite number of neurons can approximate any continuous function to a reasonable accuracy, under certain conditions for activation functions (namely, that they must be sigmoid-like).

It was formulated in 1989 by George Cybenko only for sigmoid activations - however later on it was proven by Kurt Hornik in 1991 that it can be applied to all activation functions.

Hornik also came to realize that the architecture of the neural network, not the choice of function, was the key deciding factor for the performance- and this discovery was significant in sparking the exciting developement of neural networks into the variety of applications it does today.

I believe that understanding it might be a good step towards understanding truly how these networks work.

Without delving too much into the details and mathematics of it, the key point of how the Universal Approximation theory works is the following -

Instead of trying to create a complex function that maps your inputs to your outputs, divide your function into smaller peices and assign a neuron performing simple linear manipulations to each of these less complicated peices to represent it.

One of my favorite teachers is Grant from 3Blue1Brown and the one thing that stuck with me from his videos is that given any complex problem, always try to break it down to its most simplest form.

And this is what we are doing here!

One of the things to note however is that the Universal Approximation theorm isn't meant to generalize well.

Wait, what?

Yes - neural networks generally just pose as great approximators but if you provide it with a value outside the range of the inputs, it would most probably fail.

This is sorta similar to the limited Taylor series apprixmation that models a sine wave in a certain range but it goes bezerk ourside of it.

Ultimately, neural networks are really estimators that are able to solve multi-dimensional problems and this is quite impressive by itself.

I understand that there are a lot of perspectives and there's so much to explore! And that is exactly what makes this field exciting.

At the end, if there's one of many things humanity can be proud, then it's definately the fact that we taught our computers to distingush cats and dogs!