Will it Rain Tomorrow?

Code and data used can be found here: Repository 

In this episode we will be expanding on Logistic Regression in Python, implementing much more data pre-processing steps on a larger data set that contains both numerical and categorical data (words).

Objective

Construct a logistic regression model to predict if it will rain tomorrow in a city in Australia.

1. Importing and Exploring our Data

Importing our data into python

import pandas as pd
import
numpy as np # for math operations later

df = pd.read_csv("D:\ProjectData\weatherAus.csv")
print('Size of weather data frame is :',df.shape)
df.head()

We can see here that we are working with over 100,000 rows of data.

We can see all the column names using the following code:

col_names = df.columns

col_names

We can get the general statistics of each column from the following:

df.describe()

2. Processing our data

Dropping Data

The dataset description states that RISK_MM should be dropped from the dataset since it can result in something called data leakage. This means that RISK_MM may contain information about our target variable, rain tomorrow, which we are not supposed to know — which may result in models that are “too good to be true”.

Often it is a good idea to remove columns that contain large amounts of missing data since these tend to be less useful to constructing our model.

The following code gives the number of values per column.

df.count().sort_values()

We see here that Sunshine, Evaporation, Cloud3pm, Cloud9am have less than 60% of data, so we will remove these from our dataset. We will also remove the Date column.

df = df.drop(columns=['Sunshine','Evaporation','Cloud3pm','Cloud9am', 'RISK_MM','Date'],axis=1)
df.shape

Here we have removed 7 columns of data.

Next, we will remove all rows containing na values using the following code:

Here we have removed around 30,000 rows of data.

Since we initially have quite a large dataset, removing chunks of data should not cause too large of an issue for our model.

Removing Outliers

  • Large outliers can heavily damage the accuracy of logistic and linear regression models.

To remove outliers we can look at something called a Z score.

Every column of data can be standardized into a normal distribution:

https://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg

Around 99.8% of all data is within 3 standard deviations of the mean of the column. Any data which is beyond 3 standard deviations can be considered outliers and removed from the dataset.

In this case, if we find any row of data (observation) with an outlier we will remove the entire row from the dataset.

We can do this with the following code:

from scipy import stats

z = np.abs(stats.zscore(df._get_numeric_data()))
print(z)
df= df[(z < 3).all(axis=1)]
print(df.shape)

Standardizing our data

Next, we standardize our data — since logistic and linear regression works best with data that is put all on the same scale. This is because when we apply gradient descent with regularization it is assumed that all our features are cantered around 0 (mean = 0) and have the same order of variance.

  • We only standardize our numerical data, for this we create a list called numerical that stores all columns that contain numerical data.
  • We then go through each numerical column using a for loop and standardizing each.
from sklearn import preprocessing

numerical = [var for var in df.columns if df[var].dtype=='float64']

for col in numerical:
df[col] = preprocessing.scale(df[col])

df.head()

Dealing with Categorical Data

As in the previous episode, we change all yes and no data to 1’s and 0’s so it is readable by our model.

df['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

One Hot encoding

One hot encoding is the method we use to convert categorical data into numerical data:

We see from the above image that each categorical variable is split into its column. We then use 1’s and 0’s to represent where each colour is present in the Colour column.

We will now apply this method to all of the categorical data in our dataset.

To view the categorical data in our dataset we can use the following code:

Categorical variables have the data type (dtype) as an object
Numerical variables have the data type (dtype) as float64

categorical = [var for var in df.columns if df[var].dtype=='object']

print("Number of categorical variables: ", len(categorical))

print(categorical)

We have dealt with RainToday and RainTomorrow columns. We now apply one-hot encoding to the other columns.

To implement One Hot Encoding we can use the function pd.getdummies.

We implement this function on all remaining categorical variables with the following code:

Here we have also displayed all unique values in each categorical column.

We can check if our One Hot Encoding has worked by printing out a few of the newly created columns.

From df.shape we see we now have a total of 106 columns in our dataset.

Model Features

  • In this case, we will be using all 106 features to predict if it will rain tomorrow:
# set X to all features
X = df.loc[:,df.columns!='RainTomorrow']

# set y to our target RainTomorrow
y = df.RainTomorrow
  • Even if any of our features to do not have any influence on if it will rain tomorrow we can still include them in our model.
  • This is because when calculating our model’s parameters, the model can identify features that have less influence on the final output (rainTomorrow) and will assign very small parameter values to these features — reducing their influence on the final output.

Splitting our data

After manipulating all of our data it is now time to split our data into a training and test set.

We split our data into training and test data to evaluate our model’s performance. This is known as cross-validation.

  • In this case, we will be using 25% of our data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

3. Building our Logistic Regression Model

To build our model we import Logistic Regression from the sci-kit learn library and fit it into our training data.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# set X to all features
X = df.loc[:,df.columns!='RainTomorrow']

# set y to our target RainTomorrow
y = df.RainTomorrow

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

logreg = LogisticRegression()
logreg.fit(X_train,y_train)

4. Evaluating our Model

In project 1 we used the train_test split method to evaluate our model. In this case, we are going to be using K_fold cross-validation, which is a more accurate method of evaluating our model since it uses all of our data to build and test our model.

We will be splitting our data into 5 folds.

For each fold we produce an evaluation metric, in this case, we will be using accuracy to evaluate our model.

Accuracy is essentially the percentage of correct predictions and is given by the following formula:

Because we are using 5 folds we will be producing 5 accuracy scores. We take the mean of these 5 scores to evaluate our model.

# import relevant functions
from numpy import mean
from
sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# split data into 5 folds and shuffle to avoid bias

cv = KFold(n_splits=5, random_state=1, shuffle=True)

scores = cross_val_score(logReg, X, y, scoring='accuracy', cv=cv)
average_score = mean(scores)

print('Overall Accuracy:', average_score)

Now that we have established that our model has a good accuracy of 0.85657.., meaning our model gets roughly 86% of predictions correct, we can now use it to make predictions.


5. Making predictions

In this episode, we will just be making predictions on our test data.

If we wanted to introduce new data to our model we will have to make 106 entries since our final model used 106 different features to predict if it will rain tomorrow which will take quite some time.

We will just make predictions for 50 observations from our test data.

logReg.predict(X_test[0:50])

And there you have it. Thanks for reading.

If you have any questions please leave them below!

Leave a comment

Design a site like this with WordPress.com
Get started