Simple Support Vector Machine
First we will import numpy to easily manage linear algebra and calculus operations in python. To plot the learning progress later on, we will use matplotlib.
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
Stochastic Gradient Descent
The svm will learn using the stochastic gradient descent algorithm (SGD). Gradient Descent minimizes a function by following the gradients of the cost function.
Calculating the Error
To calculate the error of a prediction we first need to define the objective function of the svm.
Hinge Loss Function
To do this, we need to define the loss function, to calculate the prediction error. We will use hinge loss for our perceptron:
$$c(x, y, f(x)) = (1 - y * f(x))_+$$
$c$ is the loss function, $x$ the sample, $y$ is the true label, $f(x)$ the predicted label.
This means the following:
$$
c(x, y, f(x))=
\begin{cases}
0,& \text{if } yf(x)\geq 1\
1-yf(x), & \text{else}
\end{cases}
$$
So consider, if y and f(x) are signed values $(+1,-1)$:
Objective Function
As we defined the loss function, we can now define the objective function for the svm:
$$\underset{w}{min}\ \lambda\parallel w\parallel^2 + \ \sum_{i=1}^n\big(1-y_i \langle x_i,w \rangle\big)_+$$
As you can see, our objective of a svm consists of two terms. The first term is a regularizer, the second term the loss. The regularizer balances between margin maximization and loss. To get more informations I advice you the tutorial introduction of the above adviced Schölkopf & Smola book.
Derive the Objective Function
To minimize this function, we need the gradients of this function.
As we have two terms, we will derive them seperately using the sum rule in differentiation.
$$ \frac{\delta}{\delta w_k} \lambda\parallel w\parallel^2 \ = 2 \lambda w_k $$
$$
\frac{\delta}{\delta w_k} \big(1-y_i \langle x_i,w \rangle\big)+ \ = \begin{cases}
0,& \text{if } y_i \langle x_i,w \rangle\geq 1\
-y_ix{ik}, & \text{else}
\end{cases}
$$
This means, if we have a misclassified sample $x_i$, respectively $y_i \langle x_i,w \rangle \ < \ 1$, we update the weight vector w using the gradients of both terms, if $y_i \langle x_i,w \rangle \geq 1$ we just update w by the gradient of the regularizer. To sum it up, our stochastic gradient descent for the svm looks like this:
if $y_i⟨x_i,w⟩ < 1$: $$ w = w + \eta (y_ix_i - 2\lambda w) $$ else: $$ w = w + \eta (-2\lambda w) $$
Our Data Set
First we need to define a labeled data set. If you read the perceptron tutorial you will already know it.
X = np.array([
[-2, 4],
[4, 1],
[1, 6],
[2, 4],
[6, 2]
])
y = np.array([-1,-1,1,1,1])
For simplicity’s sake we again fold the bias term into the data set:
X = np.array([
[-2,4,-1],
[4,1,-1],
[1, 6, -1],
[2, 4, -1],
[6, 2, -1],
])
y = np.array([-1,-1,1,1,1])
This small toy data set contains two samples labeled with $-1$ and three samples labeled with $+1$. This means we have a binary classification problem, as the data set contains two sample classes. Lets plot the dataset to see, that is is linearly seperable:
for d, sample in enumerate(X):
# Plot the negative samples
if d < 2:
plt.scatter(sample[0], sample[1], s=120, marker='_', linewidths=2)
# Plot the positive samples
else:
plt.scatter(sample[0], sample[1], s=120, marker='+', linewidths=2)
# Print a possible hyperplane, that is seperating the two classes.
plt.plot([-2,6],[6,0.5])
[<matplotlib.lines.Line2D at 0x7f4b846e42e8>]
Lets Start implementing Stochastic Gradient Descent
Finally we can code our SGD algorithm using our update rules. In opposite to the perceptrons objective function, we use a regularizer in our algorithm. As we have a small data set, which is easily lineary seperable, this is actually not needed and our stochastic gradient descent algorithm would probably converge faster without it. To give you a more powerfull code at hand, I will keep it in the following algorithm.
To keep it simple, we will linearly loop over the sample set. For larger data sets it makes sence, to randomly pick a sample during each iteration in the for-loop.
def svm_sgd(X, Y):
w = np.zeros(len(X[0]))
eta = 1
epochs = 100000
for epoch in range(1,n):
for i, x in enumerate(X):
if (Y[i]*np.dot(X[i], w)) < 1:
w = w + eta * ( (X[i] * Y[i]) + (-2 *(1/epoch)* w) )
else:
w = w + eta * (-2 *(1/epoch)* w)
return w
We will run the sgd $100000$ times. Our learning parameter eta is set to $1$. As a regulizing parameter we choose $1/t$, so this parameter will decrease, as the number of epochs increases.
Code Description Line by Line
line 2: Initialize the weight vector for the perceptron with zeros line 3: Set the learning rate to 1 line 4: Set the number of epochs line 6: Iterate n times over the whole data set. The Iterator is begins with $1$ to avoid division by zero during regularization parameter calculation line 7: Iterate over each sample in the data set. line 8: Misclassification condition $y_i \langle x_i,w \rangle < 1$ line 9: Update rule for the weights $w = w + \eta (y_ix_i - 2\lambda w)$ including the learning rate $\eta$ and the regularizer $\lambda$ line 11: If classified correctly just update the weight vector by the derived regularizer term $w = w + \eta (-2\lambda w)$.
Let the SVM learn!
Next we can execute our code, to calculate the proper weight vector, which fits out training data. If there are misclassified samples we will print the number of misclassified and correctly classified samples.
def svm_sgd_plot(X, Y):
w = np.zeros(len(X[0]))
eta = 1
epochs = 100000
errors = []
for epoch in range(1,epochs):
error = 0
for i, x in enumerate(X):
if (Y[i]*np.dot(X[i], w)) < 1:
w = w + eta * ( (X[i] * Y[i]) + (-2 *(1/epoch)* w) )
error = 1
else:
w = w + eta * (-2 *(1/epoch)* w)
errors.append(error)
print(w)
plt.plot(errors, '|')
plt.ylim(0.5,1.5)
plt.axes().set_yticklabels([])
plt.xlabel('Epoch')
plt.ylabel('Misclassified')
plt.show()
svm_sgd_plot(X,y)
[ 1.58876117 3.17458055 11.11863105]
/usr/lib/python3.7/site-packages/matplotlib/figure.py:98: MatplotlibDeprecationWarning:
Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
"Adding an axes using the same arguments as a previous axes "
Homework
Complete your own Support Vector Machine!