#### The “hello-world” of machine learning

Photo by Lucie Morel on Unsplash

Recently I was pondering what could be the most basic introduction to machine learning. I was after a simple task, such as binary classification, and of an algorithm that is sufficiently simple that it could be built from scratch and explained within a short article. If the algorithm had some history even better. It did not take much time to find the candidate: the perceptron. The perceptron takes us to the beginning of machine learning. It was introduced by Frank Rosenblatt more than 60 years ago. Like a neuron, the perceptron rule takes multiple input features and fits weights that when multiplied with the input feature vector allow taking a decision of whether the neuron outputs a signal or not, or in a machine learning classification context whether the output is 0 or 1. The perceptron is likely the simplest binary classifier and I am not aware of any practical machine learning use cases that could be tackled with it these days. However, it has significant educational and historical value as it paved the way to neural networks.

The purpose of this article is to introduce the perceptron and use it in a simple binary classification task. The perceptron has been implemented in scikit-learn but instead of relying on this we will build it from scratch. We will also create a set of visualisations to understand how the algorithm sets its decision boundary and peak into its convergence. The perceptron is a linear model comprising weights and a bias term that are simultaneously and iteratively adjusted during the fitting process. However, it does not have a continuous loss function like its perhaps immediate successor in the history of machine learning, the adaptive linear neuron (Adaline) algorithm that is also a single-layer neural network. Fitting the perceptron simply relies on detecting misclassified samples, and the weights and bias are updated immediately as soon as a misclassified sample occurs and not once per epoch (epoch being a full pass through the training set). Hence, the algorithm does not even need an optimiser. I would dare to say that is so simple and elegant that it becomes beautiful. If you are curious to see how it works stay tuned!

#### Perceptron theory

The perceptron, like other linear models, uses a set of weights, one for each of the features, and to generate a prediction it computes the dot product of the weights and the feature values and adds a bias

The result of this linear function, that is also known as net input, is fed into an activation function f(z) that in the case of the perceptron is a simple step function, i.e. f(z) takes the value of 1 if z≥0 and 0 otherwise. The role of the activation function is to map the net input to two values, namely 0 and 1. Essentially what we have done is nothing more that defining a hyperplane. Points that are at the same side of the hyperplane belong to the same class. The weights define the vector vertical to the hyperplane, i.e. the orientation of the hyperplane, and the bias the distance of the hyperplane from the origin. When the fitting process starts we have a randomly orientated hyperplane at a random distance from the origin. Whenever we encounter a misclassified sample we nudge the hyperplane a little bit and change its orientation and position so that in the next epoch the sample is at the right side of the hyperplane. We can decide how much to nudge the hyperplane, i.e. what the learning rate should be.

Usually we need to pass through all samples a few items (epochs) until no point is misclassified, or to be more precise until no more progress can be made. In every epoch we loop over all samples i = 1,.., nₛₐₘₚₗₑₛ in the training set and using the current weights and bias we examine if the model misclassifies and if it does we update all weights j=1,.., nfₑₐₜᵤᵣₑₛ using a learning rate η:

where

with hat indicating the prediction output. We also update the bias with

where

It is conceptually easy to see why we do these operations. Assume that the model predicts class 0, whilst the correct one is 1. If xⱼ is positive, then the weight will be increased so that the net input increases. If xⱼ is negative, then the weight will be reduced so that the net input once more increases (regardless of the sign of the weight). Similarly the bias will be increased leading to further increases of the net input. With these changes it is more likely to predict the correct class for the misclassified sample in the next epoch. The logic is similar when the model predicts class 1, whilst the correct one is 0, with the only difference that all signs are inverted.

If you notice carefully, the weights and bias may be updated multiple times within the same epoch, once for every misclassified sample. Every misclassification reorients and repositions the decision boundary hyperplane so the sample is correctly predicted in the next epoch.

#### Data preparation

We will be using a synthetic data set comprising two Gaussian distributions. The perceptron can be used with features of any dimension but for the purposes of this article we will limit ourselves to two dimensions to facilitate visualisation.

https://medium.com/media/e4813a171486a5f476cfa64fb8fe823d/href

that produces the following figure

Scatterplot of the two classes in the synthetic dataset. Image by the Author.

The two Gaussian distributions are elongated and put further apart on purpose by choosing appropriate means and covariances. We will come back to this later.

#### Perceptron implementation and use

The implementation of the Perceptron can be found below. We use the scikit-learn style to initiate the model, fit it and finally run a prediction

https://medium.com/media/604f172cec3a5fbe7a9d505bfd6f35e3/href

The initialisation method sets the learning rate, the maximum number of iterations and the random number generator seed for reproducibility purposes. The fit method creates a random number generator that is then used to set the weights to some small numbers sampled from a uniform distribution, whilst the bias is initiated to zero. We then iterate for a maximum number of epochs. For every epoch we count the number of misclassifications so that we can monitor convergence and terminate early if possible. For every sample that is misclassified we update the weights and bias as described in the previous section. If the number of misclassifications is zero then no further improvements can be made and hence there is no need to continue with the next epoch. The predict method is simply computing the dot product of the weights and feature values, adds the bias and applies the step function.

If we use the above Perceptron class with the synthetic data set

https://medium.com/media/2e39136686175bfb69b322b9677b2986/href

we can see that convergence was achieved in 24 epochs, i.e. it was not necessary to exhaust the maximum number of epochs specified

Perceptron convergence. Image by the Author.

The decision boundary can be readily visualised using the decision boundary utility function from scikit-learn. In order to use this function we generate a grid of 200×200 points spanning the range of feature values in the training set. Essentially we construct a contour plot with the predicted class and overlay the samples as a scatterplot coloured using the true labels. This way of plotting the decision boundary is quite generic and can be useful with any two dimensional classifier.

https://medium.com/media/24e9dcd6e83c62471c84777554bb07d5/href

The two synthetically constructed Gaussian distributions have been perfectly separated using a model that could be coded from scratch in few lines of code. The simplicity and elegance of this method makes it a brilliant introductory and motivational example for machine learning.

Decision boundary of the fitted perceptron model. Image by the Author.

We can also visualise the evolution of the decision boundary in the different epochs by stopping the model fitting process prematurely. This can be practically done by fitting the model using an increasing cutoff for the maximum number of epochs. For every attempt we use the weights and bias of the fitted (possibly non converged) model and plot the decision boundary as a line. The lines are annotated using the epoch number. This could have been implemented more elegantly using a warm start, but fitting the model is very quick and hence it is not worthwhile the additional complexity.

https://medium.com/media/74727a7364519b04b8919990e793a5cd/href

The decision boundary evolution for the various epochs is shown in the figure below. Initially, a small number of class 0 samples is misclassified, which causes gradual changes to the slope and intercept of the decision boundary line. We can see that convergence was achieved in 24 epochs, that is consistent with the convergence figure above. The fitting stops once the decision boundary achieves perfect separation of the classes, regardless of how close the boundary is to the samples in its vicinity.

A few notes of caution. Perceptron convergence cannot be taken for granted and for this reason it is important to set a maximum number of iterations. In fact, it can be proven mathematically that convergence is guaranteed for linearly separable classes. If the classes are not linearly separable then the weights and bias will keep being updated until the maximum number of iterations is reached. This is why the two Gaussians were moved further apart in the synthetic dataset.

Another important note is that the perceptron has no unique solution. Typically there will be an infinite number of hyperplanes that can separate linearly separable classes and the model will randomly converge to one of them. This also means that measuring the distance from the decision boundary is not deterministic and hence not so useful. Support-vector machines address this limitation.

The perceptron is essentially a single layer neural network. Understanding how it works is helpful before jumping to multi-layered neural networks and the backpropagation algorithm that can be used for non-linear problems.

Classification With Rosenblatt’s Perceptron was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.