Code and data used can be found here: Repository
This episode combines knowledge from all previous episodes to build, evaluate and improve a ridge regression model that makes predictions for weather data in Hungary, Szeged.
Objective
Construct a regression model that makes reasonable predictions for Humidity given the follow data:

Our model should take new inputs of: Temperature, Wind-speed, Pressure e.t.c and come up with a reasonable estimate for: Humidity.
We are going to be using Jupyter Notebook and the Sci-kit learn library to construct this model. How to set up your programming environment can be found here: Episode 4.3

1. Importing and Exploring our Data

Importing our data into python
import pandas as pd
# Store our data as a data-frame in the variable weatherData
weatherData = pd.read_csv("D:\ProjectData\weatherHistory.csv")
# Display the first few rows of our data in our notebook
weatherData.head()

Data Statistics
We can find useful statistics about our weather columns by using the following function:
weatherData.describe()

- From the count property we see that we have close to 100,000 rows of data — using k-fold cross validation will take a lot of computing power
- The Cloud Cover column consists of only 0’s and should not be included in our model
2. Preprocessing our data

Importing our modules
Our first step is to import our functions for preprocessing our data
We have imported two functions:
- train_test_split: To split our data into training and test data
- PolynomialFeatures: To enable us to apply polynomial regression
# Import train_test_split function from scikit learn to split our data into train and test data from sklearn.model_selection import train_test_split # Import PolynomialFeatures function for polynomial regression from sklearn import PolynomialFeatures
Defining our Model’s Features
We then define or model’s features, these are the variables we will be using to predict humidity.
weatherFeatures = ["Temperature (C)","Apparent Temperature (C)","Wind Speed (km/h)", "Wind Bearing (degrees)","Visibility (km)","Pressure (millibars)"]
We have excluded cloud cover since we discovered earlier that it contained constant data values of 0.
Next, we define our inputs X and output or target value y
X = weatherData[weatherFeatures] y = weatherData.Humidity
Standarizing our Data
This is done to ensure that our data has roughly the same variance.
- Pressure has a significantly larger range of values than any other variable.
- Leaving our pressure values as they are may effect our model’s accuracy
- To prevent this, we scale all our data, by subtracting their column’s mean values and dividing by their standard deviation. This puts all our data on roughly the same scale with a mean of 0. Essentially putting our data on the “same playing field”. This is known as data standardization .
X_scaled = preprocessing.scale(X)
Implementing Polynomial Features
Here we have set polynomial features to 1 which will just give us the standard multiple-linear model. (We will change this later)
poly = PolynomialFeatures(1) X_final = poly.fit_transform(X_all)
As we are not making any more adjustment to our data, we call our input data X_final
Splitting our Data
Lastly, we split our data into training and test data in order to cross-validate our model.
Here, because we have lots of data, we have split our data into 90% training data and 10% test data. We also randomly-shuffle our data to prevent any bias problems.
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.10, random_state=42)

3. Building our Regression Model

We construct our regression model on our training data using the LinearRegression function imported from sci-kit learn.
- Here we are using RidgeRegression which is linear regression with l2 regularization
- Regularization is applied here to prevent our model from overfitting our data. We have set our complexity parameter (regularization parameter) to 0.5 .
from sklearn.linear_model import LinearRegression regr = linear_model.Ridge(alpha = 0.5) regr.fit(X_train, y_train)
Lastly we use our model to make predictions for all of our test data.
y_pred = regr.predict(X_test)
We can then use our predicted humidity values (y_pred) and compare it our actual humidity values (y_test) to evaluate and improve our model.
4. Evaluating and Improving our Model

To evaluate our model we will be using two evaluation metrics:
- Mean Squared Error: Discussed in previous episodes, gives the average distance our predicted values are away from our actual values, squared.
- Coefficient of determination (R²): Shows the extent to which our output (Humidity) can be predicted. Take a value between 0 and 1.
– 0 means our model captures no relation between our input and output
– 1 means our model captures all relation between our input and output
- An R² score closer to 1 shows better model performance. Usually we look for a R² > 0.7 for our model to be considered “good”.
from sklearn.metrics import mean_squared_error, r2_score
print('Mean squared error: %.3f' % mean_squared_error(y_test, y_pred))
print('Coefficient of determination: %.3f' % r2_score(y_test, y_pred))
Evaluating our simple linear model [poly = PolynomialFeatures(1)] gives the following:

Here we have a relatively low mean square error which is good, however R² <0.7 which is poor.
Let’s try adding more polynomial features to our model by changing Polynomial features to 2: [poly = PolynomialFeatures(2)]
- Our mean squared error has reduced and R² has increased significantly, giving us a much better regression model
We will now try our model with 3,4,5,.. and more polynomial features and see what effect his has on our R² score.

- Including more and more features (polynomial combinations of our input variables) our R² slowly increases showing better model performance on our test data.
- We find that including 7 polynomial features in our model yields the best R² score of 0.676, including any more polynomial features results in our model overfitting our data.
- We could try increasing our regularization parameter alpha from 0.5 to 5 to include more polynomial features and prevent, however this will take a significant amount of computing power.

Our final model has a total of 1717 parameters (coefficients + intercept) and predicts Humidity with reasonable to good accuracy.
5. Using our Model to make Predictions

Lastly, to make use of our model, we record a new weather observation when visiting Hungary, Szeged and store our observation in the variable weatherObs.
We then use our model to make a prediction for what Humidity we could expect with our observed weather data.
- Ensure the same pre-processing steps are applied to our newly observed weather data.
- Ensure our observations are entered in order of our features: Temperature, Apparent Temperature, Wind speed, …
weatherObs = [[32, 31.4, 44, 344, 13, 1020.33]] weatherObs_scaled = preprocessing.scale(weatherObs) weatherObs_final = poly.fit_transform(weatherObs_scaled) y_pred = regr.predict(weatherObs_final) y_pred
This yields a predicted Humidity value of 0.789:

This Project concludes our work with Linear Regression.
