Support vector machines are supervised machine learning models that analyze data for classification and regression analysis. Suppose the data is split into two classes, our goal is decide whether a new point belongs in class A or class B. Using a linear classifier, we take a data set in $R^{p}$ and separate the data with a $(p - 1)$ -dimensional hyperplane.

center

There may be several hyperplanes that separate the data. We choose the hyperplane that maximizes the separation of the data. Such a hyperplane is called the maximum-margin hyperplane.

It is often the case that the data set is not linearly separable. To circumnavigate this issue, we map the data into a higher dimensional space where the data is separable. Such a mapping is defined such that the dot products of vectors may be computed easily in terms of the original space. We do this by defining them via a kernel function $k (x, y)$ . In the higher dimension, the hyperplane is then the set of points whose dot product with a vector in that space is constant. We can view the hyperplane as a linear combination of images of feature vectors $x_{i}$ . Under this, the points $x$ in the feature space that get mapped into the hyperplane are given by the relation

i \sum α_{i} k (x_{i}, x) = constant .

The sum of kernels above can be used to measure the relative nearness of a test point $x$ to the data points originating in one of the discriminated sets.

Linear SVM

Given a set of training dataset of $n$ points $(x_{i}, y_{i})$ where $y_{i} \in {- 1, 1}$ indicates the class $x_{i}$ belongs to and each $x_{i} \in R^{p}$ . Our goal is to find the maximum-margin hyperplane that divides the vectors $x_{i}$ into groups ( $y_{i} = 1$ and $y_{i} = - 1$ ) and so that the nearest point $x_{i}$ from either group is maximized.

Recall that any hyperplane can be written in the form

w^{⊤} x - b = 0

where $w$ is the normal vector to the hyperplane. The term $b /∥ w ∥$ determines the offset of the hyperplane from the origin along $w$ .

center

In the diagram above we separate the data with two parallel hyperplanes:

w^{⊤} x - b = 1 (everything on or above belongs to class 1)

and

w^{⊤} x - b = - 1 (everything on or below belongs to class -1) .

Since the distance between these hyperplanes is given by $2/∥ w ∥$ , we want to minimize $∥ w ∥$ in order to maximize the distance. We also have the additional constraints that

w^{⊤} x_{i} - b \geq 1, if y_{i} = 1

and

w^{⊤} x_{i} - b \leq - 1, if y_{i} = - 1.

Multiplying both sides by $y_{i}$ combines both of these constraints into one. In summary, we get the following optimization problem:

w, b minimize ∥ w ∥_{2}^{2} subject to y_{i} (w^{⊤} x_{i} - b) \geq 1, \forall i \in {1, \dots, n} .

Once we have solved for $w$ and $b$ , we get the classifier function $x \mapsto sgn (w^{⊤} x - b)$ .

If the data is not linearly separable, we use soft margins by making use of the hinge-loss function

max (0, 1 - y_{i} (w^{⊤} x_{i} - b))

Then the goal becomes to minimize

∥ w ∥^{2} + C [\frac{1}{n} i = 1 \sum n max (0, 1 - y_{i} (w^{⊤} x - b))]

where $C$ is a parameter controlling the trade-off between increased margin size and ensuring the points $x_{i}$ lie on the correct sides of the margins. As an optimization problem:

w, b, ζ minimize ∥ w ∥_{2}^{2} + C i = 1 \sum n ζ_{i} subject to y_{i} (w^{⊤} x_{i} - b) \geq 1 - ζ_{i}, ζ_{i} \geq 0 \forall i \in {1, \dots, n} .

For large enough $C$ , this will behave similar to hard margins.

Nonlinear SVM

We can create nonlinear classifiers by applying the “kernel trick” where we replace dot products with kernels. This allows an algorithm to fit the maximum-margin hyperplane to a transformed feature space.

center

Note that this can increase the generalization error, i.e., how accurately the model can predict outcome data for previously unseen data.

Some common kernels:

Polynomial (homogeneous): $k (x_{i}, x_{j}) = (x_{i} \cdot x_{j})^{d}$
Polynomial (inhomogeneous): $k (x_{i}, x_{j}) = (x_{i} \cdot x_{j} + r)^{d}$
Gaussian radial basis function: $k (x_{i}, x_{j}) = exp (- γ ∥ x_{i} - x_{j} ∥^{2})$ for $γ > 0$ . Often $γ = 1/ (2 σ^{2})$
Sigmoid function: $k (x_{i}, x_{j}) = tanh (κ x_{i} \cdot x_{j} + c)$ for some $κ > 0$ and $c < 0$

The kernel is related to the transform $φ (x)$ by the equation $k (x_{i}, x_{j}) = φ (x_{i}) \cdot φ (x_{j})$ . We can represent $w$ in the transformed space by

w = i \sum α_{i} y_{i} φ (x_{i}) .

Thus, for classification functions we can take the dot product with $w$ by using the kernel trick, i.e.,

w \cdot φ (x) = i \sum α_{i} y_{i} k (x_{i}, x) .

The Kernel Trick

For nonlinear SVM we mapped our data set into a higher-dimensional space where we could linearly separate. The kernel trick is a method for a learning algorithm to learn a nonlinear function without explicitly mapping into the higher-dimensional space. We want our kernel $k : X \times X \to R$ which acts in the original space to act as an inner product in the transformed space $Y$ . That is, given our transformation $φ : X \to Y$ our kernel should satisfy

k (x, y) = φ (x) \cdot φ (y) .

According Mercer’s theorem, such a function $φ$ exists if the space $X$ has a suitable measure such that $k$ satisfies

\iint g (x) k (x, y) g (y) d x d y \geq 0

for all square-integrable functions $g$ . If we choose cardinality as our measure, then this condition reduces to

i = 1 \sum n j = 1 \sum n k (x_{i}, x_{j}) c_{i} c_{j} \geq 0

for all finite sequences of points $(x_{1}, \dots, x_{n})$ in $X$ and all real-valued coefficients $c_{1}, \dots, c_{n}$ .

Recall that the classification vector $w$ in the transformed space is given by

w = i = 1 \sum n α_{i} y_{i} φ (x_{i}) .

We can obtain the coefficients $α_{1}, \dots, α_{n}$ by the following optimization problem:

maximize f (α_{1}, \dots, α_{n}) = i = 1 \sum n α_{i} - \frac{1}{2} i = 1 \sum n j = 1 \sum n y_{i} α_{i} k (x_{i}, x_{j}) y_{j} α_{j} subject to i = 1 \sum n α_{i} y_{i} = 0 and 0 \leq α_{i} \leq \frac{1}{2 nλ} for all i .

This problem can be solved with quadratic programming. We find an index $i$ such that $0 < c_{i} < 1/ (2 nλ)$ and $φ (x_{i})$ lies on the boundary of the margin in the transformed space. Then we solve

b = w^{⊤} φ (x_{i}) - y_{i}

This leaves us with the classifier function

x \mapsto sgn (w^{⊤} φ (x) - b) .

Scikit-Learn

The python package scikit-learn has support for SVMs. Documentation listing at https://scikit-learn.org/stable/modules/svm.html.

Scikit has three classes for binary and multi-class classifications: SVC, NuSVC, and LinearSVC. All three take in two arrays, $x$ which has shape number of samples-by-number of features and $y$ which is a vector with dimension number of samples.

from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
 
# Predict a new value
clf.predict([[2., 2.]])

The 2d-array $X$ represents the set of points $x_{i}$ in our data set. Where as, the vector $y$ is the classifications of each point $x_{i}$ . For example, in relation to paragonimiasis, $X$ could have a 2d-vector for each patient (e.g., weight + height) and $y$ would represent if the patient is ELISA positive/negative. Then we could separate that data using SVC.fit.

Below is some sample code:

import matplotlib.pyplot as plt
 
from sklearn import svm
from sklearn.datasets import make_blobs
from sklearn.inspection import DecisionBoundaryDisplay
 
# we create 40 separable points
X, y = make_blobs(n_samples=40, centers=2, random_state=6)
 
# fit the model, don't regularize for illustration purposes
clf = svm.SVC(kernel="linear", C=1000)
clf.fit(X, y)
 
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
 
# plot the decision function
ax = plt.gca()
DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    plot_method="contour",
    colors="k",
    levels=[-1, 0, 1],
    alpha=0.5,
    linestyles=["--", "-", "--"],
    ax=ax,
)
# plot support vectors
ax.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.show()

which produces the following image: center

Notes

Explorer

Support Vector Machine

Linear SVM

Nonlinear SVM

The Kernel Trick

Scikit-Learn

Graph View

Table of Contents

Backlinks

Source code