Code and data used can be found here: Repository
In the previous episode we discussed multiple methods in which you can evaluate your regression model. At the end of the episode we also discussed a general method which you can apply.
This article is split into two parts:
Part 1 builds a multiple linear regression model to predict fish weight given the vertical length, diagonal length, cross length, height and width of the fish in cm.
Part 2 focusses on evaluating and improving the regression model.
1. Building the Regression Model
The first step is to build the regression model which we can evaluate.
For this we are going to be using fish data from: https://www.kaggle.com/aungpyaeap/fish-market to predict the weight of a fish given multiple variables.
# Import Pandas Library, used for data manipulation
# Import matplotlib, used to plot our data
# Import nump for mathemtical operations
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Import our fish data and store it in the variable fish_data
fish_data = pd.read_csv("D:\ProjectData\Fish.csv")
# Display first few rows of data
fish_data.head()

Renaming columns
Here we rename the columns to be more detailed.
# renaming columns renamed_columns = ['Species','Weight', 'Vertical_length','Diagonal_length','Cross_length', 'Height','Width'] fish_data.columns = renamed_columns # view changes fish_data.columns

Data pre-processing
To be able to use species as a feature to help predict the fish’s weight we can use one hot encoding. This is explained at 10:40 of project 2 of this series.
# one hot encode species feature fish_data = pd.get_dummies(fish_data) # view changes fish_data.head()

We can see at the end of out data each species has been one hot encoded.
# view shape of dataframe fish_data.shape

Frome above we see that our data frame has 159 rows and 13 columns.
# input data X = fish_data.drop(['Weight'], 1) # target variable y = fish_data.Weight # split data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
Model fitting
Next we fit our model to our training data and evaluate it’s performance.
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train)
2. Evaluating and improving the Regression Model
First we take a look at the model’s performance on the test set. For this we use our model to form predictions from our input data of our test set, X_test.
These predictions are stored under the variable y_pred
y_pred = model.predict(X_test)
If we see poor performance we may have to adjust the model used. Perhaps use polynomial terms or interaction terms.
We can check the mean squared error (MSE) in python using the following:
# mean squared error from sklearn.metrics import mean_squared_error mean_squared_error(y_test, y_pred)

We can check the root mean squared error (RMSE) in python using the following:
# root mean squared error mean_squared_error(y_test, y_pred, squared = False)

We use the same function as before but set the squared parameter to equal False.
We can check the R squared score (R² score) in python using the following:
from sklearn.metrics import r2_score r2_score(y_test, y_pred)

We note quite a high r2_score showing good model performance on the test set.
We can check the adjusted R squared scored (R²-adj score) in python using the following:
# define variables for adjusted r2 score r2 = r2_score(y_test, y_pred) n = len(y_test) k = len(X_test.columns) # calculate adjusted r2 score adj_r2_score = 1-(((1-r2)*(n-1))/(n-k-1))

k is given as the number of independent variables (number of input variables in our model) which can be calculate by getting the number of columns in our X_test.
Let us now try to make some improvement to our model by adding polynomial terms. This is explained in episode 4.6,
from sklearn.preprocessing import PolynomialFeatures # transform data to include polynomial terms to third degree poly = PolynomialFeatures(degree = 3) X_degree3 = poly.fit_transform(X)
We can check the number of features of our model by looking at the shape of our transformed data:
# check number of features, this is given as the number of columns of our transformed data X_degree3.shape

From above we see that our model has 455 features.
Lastly we split our data and fit out model.
# split data X_train, X_test, y_train, y_test = train_test_split(X_degree3, y, test_size=0.15, random_state=42) # fit model degree3_model = LinearRegression() degree3_model.fit(X_train, y_train) # produce set of predictions y_pred = degree3_model.predict(X_test)
Let us now check the R2 score of our new model.
r2_score(y_test, y_pred)

We see that we have a very low R2 score that is less than 0. This model performs even worse than a model that just predicts the mean value for each observation. This is likely due to overfitting since we added more features to our model.
- A model that overfits our data performs well on the training data but poorly on the test data.
- A model that underfits our data performs poorly on both the training and test data.
Overfitting and underfitting are explained in episode 5.
We have seen that our model performs poorly on the test data, let us now check its performance on the training data:
# produce set of predictions from training data X_train y_train_pred = polynomial_model.predict(X_train) r2_score(y_train, y_train_pred)

From above we see that our model performs very well on the training data (very close to 1) which indicates that our model is overfitting our data.
By just including interaction terms in our model:
poly = PolynomialFeatures(interaction_only = True)
Reduced our number of features from 455 to 79.

Our R2 score has increased and is higher than our original model with 12 features.
Note in this example our number of features exceed our number of test examples (k>n). I this case the adjusted r2 score takes a value larger than 1 and is not accurate in evaluating model performance.
If you have any questions please leave them below!
