#### Comes with a Free Deep Overview on SVMs

In this story, we shall implement the support vector machine learning algorithm in its general soft-margin and kernelized form. We will start by providing a brief overview of SVM and its training and inference equations, then subsequently translate these into code to develop the SVM model. Afterwards, we extend our implementation to handle multiclass scenarios and conclude by testing our model using Sci-kit Learn.

Thus, by the end of this story:

You will gain a clear perspective of various important SVM concepts.You will be able to implement, with genuine comprehension, the SVM model from scratch for the binary and multiclass cases.Van Gogh Starry Night with a Line of Symmetry and Stars— Generated by author using DALLE 2

#### Table of Content

· Brief Overview

∘ Hard Margin SVM

∘ Soft Margin SVM

∘ Kernel Soft Margin SVM

· Implementation

∘ Basic Imports

∘ Defining Kernels and SVM Hyperparameters

∘ Define the Predict Method

∘ Define the Predict Method

∘ Test the Implementation

∘ Generalizing Fit to Multiclass

∘ Generalizing Predict to Multiclass

∘ Testing the Implementation

### Brief Overview

#### Hard Margin SVM

The goal in SVM is to fit the hyperplane that would attain the maximum margin (distance from the closest points from in the two classes). It can be shown, and is intuitive that such hyperplane (A) has better generalization properties and is more robust to noise than a hyperplane that doesn’t maximize the margin (B).

Figure by Ennepetaler86 on Wikimedia. CC BY-SA 3.0 Unported.

In order to achieve this, SVM finds the hyperplane’s *W* and b by solving the following optimization problem:

It attempts to find *W,b* that maximizes the distance to the closest point and classifies everything correctly (as in the constraint where y takes ±1). This can be shown to be equivalent to the following optimization problem:

For which one can write the equivalent dual optimization problem

The solution to this yields a Lagrange multiplier for each point in the dataset which we assume to have size *m: (α**₁**, α₂, …, α_N)*. The objective function is clearly quadratic in *α *and the constraints are linear which means that this can be easily solved with quadratic programming. Once the solution found, it follows from the derivation of the dual that:

*(xₛ,yₛ)* is any point with *α>0*

Notice that only points with *α>0 *define the hyperplane (contribute to the sum). Those are called support vectors.

And thereby, the prediction equation, that when given a new example *x, *returns its prediction *y=±1* is:

Involves plugging then doing some algebraic simplification

This basic form of SVM is called hard margin SVM because the optimization problem it solves (as defined above) enforces that all points in training must be classified correctly. In practical scenarios, there may be some noise that prevents or limits the existence of a hyperplane that perfectly separates the data in which case the optimization problem would return no or a poor solution.

#### Soft Margin SVM

Fit by Soft Margin SVM by Mangat et al on Research Gate. CC BY-SA 4.0 International

To generalize hard margin SVM, soft margin SVM adapts the optimization problem by introducing a C constant (user given hyperparamer) that controls how “hard” it should be. In particular, it modifies the primal optimization problem to become the following:

modifications in blue

which allows each point to make some violation ϵₙ (e.g., be in the wrong side of the hyperplane) but still aims to reduce them overall by weighting their sum in the objective function by C. It becomes equivalent to hard margin as C approaches infinity (generally before it does). Meanwhile, a smaller C would allow more violations (in return for a wider margin; i.e., smaller *wᵗw*).

Quite surprisingly, it can be shown that the equivalent dual problem only changes by constraining *α *for each point to be *≤C.*

Since violations are allowed, support vectors (points with *α>0) *are no longer all on the margin’s edge. It can be shown that any support vector that has committed a violation will have *α=C *and that non-support vectors (*α=0) *cannot commit violations. We call support vectors that potentially committed violations (*α=C*) “non-margin support vectors” and other pure ones (that have not committed violations; lie on the edge) “margin support vectors” (0<*α<C*).

It can be shown that the inference equation doesn’t change:

However, now *(xₛ,yₛ) *must now be a support vector that has not committed violations because the equation assumes it’s on the margin’s edge (previously, any support vector could be used).

#### Kernel Soft Margin SVM

Soft Margin SVM extends the Hard Margin SVM to handle noise, but frequently, data is not separable by a hyperplane due to factors beyond noise, such as being naturally nonlinear. Soft Margin SVM can be used in cases like these, but then the optimal solution will probably involve a hyperplane that permits much more errors than is tolerable in reality.

Figure by Machine Learner on Wikimedia. CC BY-SA 4.0 International

Kernel Soft Margin SVM generalizes Soft Margin SVM to deal with situations where the data is naturally nonlinear. For instance, in the example shown on the left there is no linear hyperplane that Soft Margin SVM can find, regardless to the setting of C, that would decently separate the data.

It is possible, however, to map each point *x* in the dataset to a higher dimension via some transform function *z=Φ(x)*** **to make the data more linear (or perfectly linear) in the new higher dimensional space. This is equivalent to replacing

*x*with

*z*in the dual to get:

In reality, especially when *Φ *transforms into a very high-dimensional space, computing *z* can take a very long time. This is solved by the kernel trick which replaces the *z*ᵗ*z *with an **equivalent** computation of a mathematical function (called kernel function) and which is much faster (e.g., an algebraic simplification of *z*ᵗ*z). *For instance, here are some popular kernel functions (each of which corresponds to some transformation *Φ* to a higher dimensional space):

Degree of polynomial (Q) and RBF γ are hyperparameters (decided by the user)

With this, the dual optimization problem becomes:

and intuitively, the inference equation becomes (after algebraic manipulation):

A full derivation of all the equations above assuming that you have the mathematical background can be found here.

Photo by Scott Graham on Unsplash

### Implementation

For the implementation we will use

#### Basic Imports

We start by importing some basic libraries:

import numpy as np # for basic operations over arrays

from scipy.spatial import distance # to compute the Gaussian kernel

import cvxopt # to solve the dual opt. problem

import copy # to copy numpy arrays

#### Defining Kernels and SVM Hyperparameters

We start by defining the three kernels using their respective functions

Degree of polynomial (Q) and RBF γ are hyperparameters (decided by the user)class SVM:

linear = lambda x, xࠤ , c=0: x @ xࠤ.T

polynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Q

rbf = lambda x, xࠤ, γ=10: np.exp(-γ*distance.cdist(x, xࠤ,’sqeuclidean’))

kernel_funs = {‘linear’: linear, ‘polynomial’: polynomial, ‘rbf’: rbf}

For consistency with other kernels, the linear kernel takes an extra useless hyperparameter. As obvious, kernel_funs takes a string for the kernel and returns the corresponding kernel function.

Now let’s carry on by defining the constructor:

class SVM:

linear = lambda x, xࠤ , c=0: x @ xࠤ.T

polynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Q

rbf = lambda x, xࠤ, γ=10: np.exp(-γ*distance.cdist(x, xࠤ,’sqeuclidean’))

kernel_funs = {‘linear’: linear, ‘polynomial’: polynomial, ‘rbf’: rbf}

def __init__(self, kernel=’rbf’, C=1, k=2):

# set the hyperparameters

self.kernel_str = kernel

self.kernel = SVM.kernel_funs[kernel]

self.C = C # regularization parameter

self.k = k # kernel parameter

# training data and support vectors (set later)

self.X, y = None, None

self.αs = None

# for multi-class classification (set later)

self.multiclass = False

self.clfs = []

The SVM has three main hyperparameters, the kernel (we store the string given and the corresponding kernel function), the regularization parameter C and the kernel hyperparameter (to be passed to the kernel function); it represents Q for the polynomial kernel and γ for the RBF kernel.

#### Define the Fit Method

To extend this class with fit and predict functions in separate cells, we will define the following function and use it as a decorator later:

SVMClass = lambda func: setattr(SVM, func.__name__, func) or func

Recall that fitting the SVM corresponds to finding the support vector *α *for each point by solving the dual optimization problem:

Let ** α **be a variable column vector

*(α*

*₁**α₂ … α_N)*ᵗ and let y be a constant column vector for the labels

*(y*

*₁**y₂ … y_N)*ᵗ and let

*K*be a constant matrix where

*K[n,m]*computes the kernel at

*(xₙ, xₘ)*. Recall the following index-based equivalences for the dot product, outer product and quadratic form respectively:

to be able to write the dual optimization problem in matrix form as follow:

Knowing that this is a quadratic program as we hinted earlier, we can read over Quadratic Programming in CVXOPT’s documentation:

From CVXOPT Documentation. GNU General Public License

The square brackets tell you that you can call this with *(P,q)* only or *(P,q,G,h)* or *(P, q, G, h, A, b)* and so on (anything not given will be set by a default value such as 1).

To know the values for *(P, q, G, h, A, b) *in our case, we make the following comparison:

Let’s the comparison easier by rewriting the first as follows:

Notice that we changed the max to min by multiplying the function by -1

It’s now obvious that (note that *0≤*** α **is equivalent to

*-α≤0):*

By this, we can write the following fit function:

@SVMClass

def fit(self, X, y, eval_train=False):

# if more than two unique labels, call the multiclass version

if len(np.unique(y)) > 2:

self.multiclass = True

return self.multi_fit(X, y, eval_train)

# if labels given in {0,1} change it to {-1,1}

if set(np.unique(y)) == {0, 1}: y[y == 0] = -1

# ensure y is a Nx1 column vector (needed by CVXOPT)

self.y = y.reshape(-1, 1).astype(np.double) # Has to be a column vector

self.X = X

N = X.shape[0] # Number of points

# compute the kernel over all possible pairs of (x, x’) in the data

# by Numpy’s vectorization this yields the matrix K

self.K = self.kernel(X, X, self.k)

### Set up optimization parameters

# For 1/2 x^T P x + q^T x

P = cvxopt.matrix(self.y @ self.y.T * self.K)

q = cvxopt.matrix(-np.ones((N, 1)))

# For Ax = b

A = cvxopt.matrix(self.y.T)

b = cvxopt.matrix(np.zeros(1))

# For Gx <= h

G = cvxopt.matrix(np.vstack((-np.identity(N),

np.identity(N))))

h = cvxopt.matrix(np.vstack((np.zeros((N,1)),

np.ones((N,1)) * self.C)))

# Solve

cvxopt.solvers.options[‘show_progress’] = False

sol = cvxopt.solvers.qp(P, q, G, h, A, b)

self.αs = np.array(sol[“x”]) # our solution

# a Boolean array that flags points which are support vectors

self.is_sv = ((self.αs-1e-3 > 0)&(self.αs <= self.C)).squeeze()

# an index of some margin support vector

self.margin_sv = np.argmax((0 < self.αs-1e-3)&(self.αs < self.C-1e-3))

if eval_train:

print(f”Finished training with accuracy{self.evaluate(X, y)}”)

We ensure that this is a binary problem and that the binary labels are set as assumed by SVM (±1) and that *y* is a column vector with dimensions *(N,1)*. Then we solve the optimization problem to find *(α**₁** α₂ … α_N)*ᵗ.

We use *(α**₁** α₂ … α_N)*ᵗ to get an array of flags that is 1 at any index corresponding to a support vector so that we can later apply the prediction equation by only summing over support vectors and an index for a margin support vector for *(xₛ,yₛ). *Notice that in the checks we do assume that non-support vectors may not have *α=0* exactly, if it’s *α≤1e-3 *then this is approximately zero (we know CVXOPT results may not be ultimately precise). Likewise, we assume that non-margin support vectors may not have *α=C *exactly.

#### Define the Predict Method

Recall the prediction equation is:

@SVMClass

def predict(self, X_t):

if self.multiclass: return self.multi_predict(X_t)

# compute (xₛ, yₛ)

xₛ, yₛ = self.X[self.margin_sv, np.newaxis], self.y[self.margin_sv]

# find support vectors

αs, y, X= self.αs[self.is_sv], self.y[self.is_sv], self.X[self.is_sv]

# compute the second term

b = yₛ – np.sum(αs * y * self.kernel(X, xₛ, self.k), axis=0)

# compute the score

score = np.sum(αs * y * self.kernel(X, X_t, self.k), axis=0) + b

return np.sign(score).astype(int), score

That’s it. We can also implement an evaluate method to compute the accuracy (used in fit above).

@SVMClass

def evaluate(self, X,y):

outputs, _ = self.predict(X)

accuracy = np.sum(outputs == y) / len(y)

return round(accuracy, 2)

#### Test the Implementation

from sklearn.datasets import make_classification

import numpy as np

# Load the dataset

np.random.seed(1)

X, y = make_classification(n_samples=2500, n_features=5,

n_redundant=0, n_informative=5,

n_classes=2, class_sep=0.3)

# Test Implemented SVM

svm = SVM(kernel=’rbf’, k=1)

svm.fit(X, y, eval_train=True)

y_pred, _ = svm.predict(X)

print(f”Accuracy: {np.sum(y==y_pred)/y.shape[0]}”) #0.9108

# Test with Scikit

from sklearn.svm import SVC

clf = SVC(kernel=’rbf’, C=1, gamma=1)

clf.fit(X, y)

y_pred = clf.predict(X)

print(f”Accuracy: {sum(y==y_pred)/y.shape[0]}”) #0.9108

You can change the dataset and hyperparameter to further ensure that they are the same. Ideally, do so by comparing decision functions rather than accuracy.

#### Generalizing Fit to Multiclass

@SVMClass

def multi_fit(self, X, y, eval_train=False):

self.k = len(np.unique(y)) # number of classes

# for each pair of classes

for i in range(self.k):

# get the data for the pair

Xs, Ys = X, copy.copy(y)

# change the labels to -1 and 1

Ys[Ys!=i], Ys[Ys==i] = -1, +1

# fit the classifier

clf = SVM(kernel=self.kernel_str, C=self.C, k=self.k)

clf.fit(Xs, Ys)

# save the classifier

self.clfs.append(clf)

if eval_train:

print(f”Finished training with accuracy {self.evaluate(X, y)}”)

To generalize the model to multiclass, over *k* classes. We train a binary SVM classifier for each class present where we loop on each class and relabel points belonging to it into +1 and points from all other classes into -1.

The result from training is *k* classifiers when given *k* classes where the *ith* classifier was trained on the data with the *ith* class being labeled as +1 and all others being labeled as -1.

#### Generalizing Predict to Multiclass

Then to perform prediction on a new example, we choose the class for which the corresponding classifier is most confident (has the highest score)

@SVMClass

def multi_predict(self, X):

# get the predictions from all classifiers

N = X.shape[0]

preds = np.zeros((N, self.k))

for i, clf in enumerate(self.clfs):

_, preds[:, i] = clf.predict(X)

# get the argmax and the corresponding score

return np.argmax(preds, axis=1), np.max(preds, axis=1)

#### Testing the Implementation

from sklearn.datasets import make_classification

import numpy as np

# Load the dataset

np.random.seed(1)

X, y = make_classification(n_samples=500, n_features=2,

n_redundant=0, n_informative=2,

n_classes=4, n_clusters_per_class=1,

class_sep=0.3)

# Test SVM

svm = SVM(kernel=’rbf’, k=4)

svm.fit(X, y, eval_train=True)

y_pred = svm.predict(X)

print(f”Accuracy: {np.sum(y==y_pred)/y.shape[0]}”) # 0.65

# Test with Scikit

from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import SVC

clf = OneVsRestClassifier(SVC(kernel=’rbf’, C=1, gamma=4)).fit(X, y)

y_pred = clf.predict(X)

print(f”Accuracy: {sum(y==y_pred)/y.shape[0]}”) # 0.65

Plotting the decision regions for each further leads to the following plot:

Figure by the author

Beware that, although Sci-kit Learn’s SVM supports OVR by default (without an explicit call as shown above), that potentially also has further optimizations specific to SVM.

#### Full Code

import numpy as np # for basic operations over arrays

from scipy.spatial import distance # to compute the Gaussian kernel

import cvxopt # to solve the dual optimization problem

import copy # to copy numpy arrays

class SVM:

linear = lambda x, xࠤ , c=0: x @ xࠤ .T

polynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Q

rbf = lambda x, xࠤ , γ=10: np.exp(-γ * distance.cdist(x, xࠤ,’sqeuclidean’))

kernel_funs = {‘linear’: linear, ‘polynomial’: polynomial, ‘rbf’: rbf}

def __init__(self, kernel=’rbf’, C=1, k=2):

# set the hyperparameters

self.kernel_str = kernel

self.kernel = SVM.kernel_funs[kernel]

self.C = C # regularization parameter

self.k = k # kernel parameter

# training data and support vectors

self.X, y = None, None

self.αs = None

# for multi-class classification

self.multiclass = False

self.clfs = []

# This is useless here (only for notebook)

SVMClass = lambda func: setattr(SVM, func.__name__, func) or func

@SVMClass

def fit(self, X, y, eval_train=False):

if len(np.unique(y)) > 2:

self.multiclass = True

return self.multi_fit(X, y, eval_train)

# relabel if needed

if set(np.unique(y)) == {0, 1}: y[y == 0] = -1

# ensure y has dimensions Nx1

self.y = y.reshape(-1, 1).astype(np.double) # Has to be a column vector

self.X = X

N = X.shape[0]

# compute the kernel over all possible pairs of (x, x’) in the data

self.K = self.kernel(X, X, self.k)

# For 1/2 x^T P x + q^T x

P = cvxopt.matrix(self.y @ self.y.T * self.K)

q = cvxopt.matrix(-np.ones((N, 1)))

# For Ax = b

A = cvxopt.matrix(self.y.T)

b = cvxopt.matrix(np.zeros(1))

# For Gx <= h

G = cvxopt.matrix(np.vstack((-np.identity(N),

np.identity(N))))

h = cvxopt.matrix(np.vstack((np.zeros((N,1)),

np.ones((N,1)) * self.C)))

# Solve

cvxopt.solvers.options[‘show_progress’] = False

sol = cvxopt.solvers.qp(P, q, G, h, A, b)

self.αs = np.array(sol[“x”])

# Maps into support vectors

self.is_sv = ((self.αs > 1e-3) & (self.αs <= self.C)).squeeze()

self.margin_sv = np.argmax((1e-3 < self.αs) & (self.αs < self.C – 1e-3))

if eval_train:

print(f”Finished training with accuracy {self.evaluate(X, y)}”)

@SVMClass

def multi_fit(self, X, y, eval_train=False):

self.k = len(np.unique(y)) # number of classes

# for each pair of classes

for i in range(self.k):

# get the data for the pair

Xs, Ys = X, copy.copy(y)

# change the labels to -1 and 1

Ys[Ys!=i], Ys[Ys==i] = -1, +1

# fit the classifier

clf = SVM(kernel=self.kernel_str, C=self.C, k=self.k)

clf.fit(Xs, Ys)

# save the classifier

self.clfs.append(clf)

if eval_train:

print(f”Finished training with accuracy {self.evaluate(X, y)}”)

@SVMClass

def predict(self, X_t):

if self.multiclass: return self.multi_predict(X_t)

xₛ, yₛ = self.X[self.margin_sv, np.newaxis], self.y[self.margin_sv]

αs, y, X= self.αs[self.is_sv], self.y[self.is_sv], self.X[self.is_sv]

b = yₛ – np.sum(αs * y * self.kernel(X, xₛ, self.k), axis=0)

score = np.sum(αs * y * self.kernel(X, X_t, self.k), axis=0) + b

return np.sign(score).astype(int), score

@SVMClass

def multi_predict(self, X):

# get the predictions from all classifiers

preds = np.zeros((X.shape[0], self.k))

for i, clf in enumerate(self.clfs):

_, preds[:, i] = clf.predict(X)

# get the argmax and the corresponding score

return np.argmax(preds, axis=1)

@SVMClass

def evaluate(self, X,y):

outputs, _ = self.predict(X)

accuracy = np.sum(outputs == y) / len(y)

return round(accuracy, 2)

from sklearn.datasets import make_classification

import numpy as np

# Load the dataset

np.random.seed(1)

X, y = make_classification(n_samples=500, n_features=2,

n_redundant=0, n_informative=2, n_classes=4,

n_clusters_per_class=1, class_sep=0.3)

# Test SVM

svm = SVM(kernel=’rbf’, k=4)

svm.fit(X, y, eval_train=True)

y_pred = svm.predict(X)

print(f”Accuracy: {np.sum(y==y_pred)/y.shape[0]}”)

# Test with Scikit

from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import SVC

clf = OneVsRestClassifier(SVC(kernel=’rbf’, C=1, gamma=4)).fit(X, y)

y_pred = clf.predict(X)

print(f”Accuracy: {sum(y==y_pred)/y.shape[0]}”)

Photo by Nathan Van Egmond on Unsplash

In summary, we implemented the support vector machine (SVM) learning algorithm, covering its general soft-margin and kernelized form. We provided an overview of SVM, developed the model in code, extended it for multiclass scenarios, and validated our implementation using Sci-kit Learn.

Hope you find what you have learnt from this story useful for your work. Till next time, au revoir.

**Resources:**

The code is mostly adaptations over the one present here (MIT License):

Persson, Aladdin. “SVM from Scratch — Machine Learning Python (Support Vector Machine).” YouTube.

Easily Implement Multiclass SVM From Scratch in Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Heya i am for the primary time here. I found this board and I in finding It really helpful & it helped

me out a lot. I am hoping to offer something back and aid others

like you aided me.