Non-linear Support Vector Machines in Python

Code and data used can be found here: Repository 

An explanation of Non-linear support vector machines: Episode 9.3

How to set up your programming environment can be found at the start of :
Episode 4.3

Objective

Produce a non-linear support vector machine that is able to correctly classify a pulsar star.

Importing and exploring our Data

We have two csv files containing the star data. One that contains the training data and one the test data.

import warnings
warnings.filterwarnings("ignore")

import pandas as pd

star_data = pd.read_csv("D:\ProjectData\pulsar_data.csv")

star_data.head()

Class 0 indicates that the star is not a pulsar star

Class 1 indicates that the star is a pulsar star

We can view the number of rows and columns in our datasets.

star_data.shape

From above we see that our data contains 12528 rows and 9 columns.

Data Pre-processing

Check for missing values

star_data.isnull().sum()

From above we note some rows contain missing values. Most column names also have a space in front which we can remove.

We remove rows with missing values using the following code:

star_data.dropna(inplace = True)

‘inplace = True’ ensures that once we have remove rows with na, we save the new data back into the variable star_data. Without this, star_data would be unaltered. Alternatively, we can create a new variable star_data_full = star_data.dropna() (by default inplace = False) where star_data_full would contain the data with rows with missing values removed.

Setting new columns names

star_data.columns

We note that columns apart from target_class contain a space in front. We can remove these spaces by creating a new array of column names with the space removed.

star_data.columns = star_data.columns.str.strip()

star_data.columns

Splitting the data and data standardization

X = star_data.drop('target_class', 1)
y = star_data['target_class']

We define X to be our model input which are all columns apart from target_class and y is our model output which is target_class.

We can check the distribution of y using the following code:

y.value_counts()

We note from above the majority of the target class is 0. Using accuracy as a metric to determine the performance of the SVM may not be a good idea since with a simple model that just predicts 0 for every input would have a high accuracy of 8423/9273 = 0.91% (2dp).

Instead we can look at what is called the F1 score to evaluate our model’s performance.

Data Standardization

Support vector machines can be influenced by the scales of features.

It is better to ensure that each feature follows the same scale and contributes equally to the calculation of the hyperplane.

Here we apply data standardization so that each feature, apart from target_class, has mean 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler

s_scaler = StandardScaler()

X_ss = pd.DataFrame(s_scaler.fit_transform(X), 
columns = X.columns)

Splitting the data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.25, random_state=42)

We split our data into 75% training data and 25% test data. By default shuffle is set to true, meaning our data is shuffled before being split to avoid bias.

Implementing the SVM

We can start off by looking at the performance of a linear support vector machine.

Linear SVM

from sklearn import svm

clf_linear = svm.SVC(kernel = 'linear', C = 10)

clf_linear.fit(X_train, y_train)

We set the kernel to be linear and the cost parameter to 10. Recall in the previous episode, that increasing the cost puts a larger penalty on misclassified points and may result in overfitting since our model becomes overly complex to classify all examples.

We evaluate our model using the F1 score which is given as the weighted average of precision and recall:

The F1 score takes a value between 0 and 1. With a best score of 1 and worst score of 0.

I hope to cover more about model evaluation methods in the next episode.

from sklearn.metrics import f1_score

y_pred = clf.predict(X_test)

f1_score(y_test, y_pred)

From above we obtain a high F1 score of 0.872. Perhaps we can perform better using the radial basis function kernel instead of the linear kernel.

Non-linear SVM

clf_rbf = svm.SVC(kernel = 'rbf', C = 10)

clf_rbf.fit(X_train, y_train)

y_pred = clf_rbf.predict(X_test)

f1_score(y_test, y_pred)

Here we set the kernel = rbf (gaussian radial basis function kernel) which enables us to fit a non-linear support vector machine.

y_pred = clf_rbf.predict(X_test)

f1_score(y_test, y_pred)

We note from above that using the rbf kernel has improved the performance of our model.

We can now check the performance our our SVM with different cost parameters:

costs = [0.1, 1, 10, 100, 1000]
for cost in costs:
    clf_rbf = svm.SVC(kernel = 'rbf', C = cost)
    clf_rbf.fit(X_train, y_train)
    y_pred = clf_rbf.predict(X_test)
    score = f1_score(y_test, y_pred)
    print(f"Cost: {cost} F1 score: {score}")

From above, we see that a cost of 10 yields the highest F1 score out of the costs in the list.

If you have any questions please leave them below!

Leave a comment

Design a site like this with WordPress.com
Get started