What are Support Vector Machines?
Support Vector machines are a common supervised machine learning algorithm used in both classification and regression problems, however are most commonly used for classification which will be the focus for this article.
Overview
The job of a support vector machine for classification problems is take labelled data such as the following:

and determine a hyperplane that separates the data:

We can then use this hyperplane to make predictions for which class a new data point belong to:

Since our new observation lies above the hyperplane we predict this observation belongs to Class A.
Here, we will be focussing on linearly separable support vector machines. The difference between linearly separable and non-linearly separable data is illustrated below:

Terminology
Before we go on to see the general method of calculating the hyperplane we must first be aware of some support vector machine terminology.
Support Vectors: Vectors or points that are closest to our hyperplane from each class, if these vectors were to be removed the position of our hyperplane may change.

Margin: The distance our hyperplane is away to the closest vector. In this case the hyperplane is at the mid-point of the two support vectors so the margin values are the same.

Hyperplane (SVM): A decision boundary used to separate and classify data. In 2-dimensional space (2 features) a hyperplane is 1 dimensional which is a line (the case we have above). In 3-dimensional space (3 features) a hyperplane is 2 dimensions, a plane. In general a Hyperplane is n-1 dimensions where n is our number of features.
Calculating the Hyperplane
Take a look at the following three hyperplane separating our data:

Our optimal hyperplane is given by the hyperplane with the largest margin which in this case is shown in graph 1 above.
Our Hyperplane takes the following formula:

where
- w is a vector of weights of parameters
- b is our bias term or intercept
- x is our input vector, in the above case we just have two features x₁ and x₂ but we may have hundreds or more.
All data points x that belong in class A have label y = 1.
All data points x that belong in class B have label y =-1.
We give an objective that our hyperplane:
- predicts a value greater than or equal to 1 for all data points in Class A (when y = 1).
- Predict a value less than or equal to -1 for all data point in Class B.
(when y = -1).

Mathematically:

We can combine the above conditions into one to give:

Keeping this condition in mind, we will now calculate algebraically our margin.
Using the formula for the distance between a line and a point:

giving a total margin of:

Remember that best hyperplane is that with the largest margin. So to maximise the margin above we must minimise:

We take the square of this value and half it, to ease solving this minimisation problem, and are left with the following minimization problem with the condition identified before:

This is a non-linear programming problem, which can be solved using Karush-Kuhn-Tucker conditions. Applying the aforementioned method, we obtain the conditions:

where m is our number of training examples and λ is our Lagrangian multiplier.
Read more about how these conditions are obtained here:
https://www.csd.uwo.ca/~xling/cs860/papers/SVM_Explained.pdf
We use the conditions above to obtain optimal values for w and b given training data x and labels y, which gives us the equation of the hyperplane with the largest margin as required.
Summary
- Support Vector Machines are commonly used in classification problems.
- SVMs work by calculating a hyperplane which separates our data with the largest margin. This is done using and optimization technique called non-linear programming.
- We then use this hyperplane as a decision boundary to classify data into groups.
