Logistic Regression in Python

You can view and use the code and data used in this episode here: Link

Consider reading Episode 7.1 before continuing, which explains how logistic regression works.

How to set up your programming environment can be found at the start of :
Episode 4.3

Objective

Predict whether it will rain tomorrow in Albury, Australia given the following data:

Importing our Data

  • We store our data in the variable df short for data frame.
  • df.shape gives the number of rows and columns in our data.
  • df.head displays the first few rows of data on our notebook.
# Read the data
import pandas as pd
df = pd.read_csv("D:\ProjectData\weatherAlbury.csv")
print('Size of weather data frame is :',df.shape)
df.head()

We see that our weather data has 3011 rows and 13 columns.

Pre-processing Our Data

Removing Missing Entries and Converting to Binary

In this episode, for the pre-processing, we will just be removing any rows which contain a NA. (Not Applicable). This is commonly done if we want to apply a model to our data quickly.

Since we are removing a whole row — we may be losing valuable data. For example, we are removing the row shown above, even though we still have data for MinTemp, MaxTemp, Humidity and more.

Often we replace NA with the mean or mode of that column — we will discuss this in a future episode.

We also replace all yes and no’s with binary numbers 0 and 1, since models work with numbers, not words.

# Preprocess the data
df = df.dropna()
print("new shape:" ,df.shape)
# Replace yes and no with 1 and 0
df['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

Previously we had 3011 rows and now 2981 — so we have removed 30 rows of data.

Defining our Model’s Features

We define our model’s features using the following code, in this case, we are using all of our features to predict whether it will rain tomorrow in Albury.

X = df[["MinTemp", "MaxTemp", "Rainfall", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "Temp9am", "Temp3pm", "RainToday"]]
y = df.RainTomorrow

Splitting Our Data

We split our data into training and test data to cross-validate our model.

Here we will be using 80% training data and 20% test data. To split our data we must import train_test_split from the sklearn library. We also randomly-shuffle our data to prevent any bias problems. random_state = 42, keeps the shuffling algorithm the same, so we produce consistent results which are easier to evaluate.

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20, random_state=42)

Building our Logistic Regression Model

We import the logistic regression function from the sci-kit learn library and apply it to our data.

We use y_pred to get a set of predicted values from our test data, to evaluate our model.

logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)

Putting our code together we get:

# Implement Logistic Regression Model
from sklearn.model_selection import train_test_split
X = df[["MinTemp", "MaxTemp", "Rainfall", "Humidity9am", "Humidity3pm", "Pressure9am", 
        "Pressure3pm", "Temp9am", "Temp3pm", "RainToday"]]
y = df.RainTomorrow
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20, random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)

Evaluating Our Model

To evaluate our model we will be using an evaluation metric called accuracy score. This metric was explained at the end of the previous episode and is given by the following formula:

  • True negative is the number of times our model predicted no correctly
  • True positive is the number of times our model predicted yes correctly
  • Total is the total number of predictions (correct + incorrect predictions)
# Evaluate the model
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,y_pred)
print('Accuracy :',score)

Here our model is given an accuracy score of around 0.866 which is quite good. We can now go on to use our model to make predictions for new observations.

Making Predictions

Let’s say we recorded 3 new weather observations in Albury, and want to use our model built to predict whether it will rain tomorrow.

We have saved our weather observations as a CSV file named weatherObs

To make predictions for the following weather observations we need to ensure we do the same pre-processing steps as before. Here we have no NA values, so the only step we have to do is convert the yes and no to binary.

# Making predictions
Obs = pd.read_csv("D:\ProjectData\Observations\weatherObs.csv")
Obs['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
# Extract just the data values (ignore column headings)
newData = Obs.values
y_pred = logreg.predict(newData)
y_pred

Our modal predicts yes for the first and third observation and no for the second.


This was a fairly quick implementation of logistic regression.

To make the best use of our data we can do more pre-processing steps and also implement functions that can tell us which features to use in our model, this will be done in the next episode, project 2.

Thanks for reading.

If you have any questions please leave them below in the comments!

Leave a comment

Design a site like this with WordPress.com
Get started