Final Project: Classification with Python
Table of Contents
Estimated Time Needed: 180 min
Instructions
In this notebook, you will practice all the classification algorithms that we have learned in this course.
Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.
We will use some of the algorithms taught in the course, specifically:
- Linear Regression
- KNN
- Decision Trees
- Logistic Regression
- SVM
We will evaluate our models using:
- Accuracy Score
- Jaccard Index
- F1-Score
- LogLoss
- Mean Absolute Error
- Mean Squared Error
- R2-Score
Finally, you will use your models to generate the report at the end.
About The Dataset
The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from http://www.bom.gov.au/climate/dwo/.
The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData
This dataset contains observations of weather metrics for each day from 2008 to 2017. The weatherAUS.csv dataset includes the following fields:
| Field | Description | Unit | Type |
|---|---|---|---|
| Date | Date of the Observation in YYYY-MM-DD | Date | object |
| Location | Location of the Observation | Location | object |
| MinTemp | Minimum temperature | Celsius | float |
| MaxTemp | Maximum temperature | Celsius | float |
| Rainfall | Amount of rainfall | Millimeters | float |
| Evaporation | Amount of evaporation | Millimeters | float |
| Sunshine | Amount of bright sunshine | hours | float |
| WindGustDir | Direction of the strongest gust | Compass Points | object |
| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |
| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |
| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |
| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |
| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |
| Humidity9am | Humidity at 9am | Percent | float |
| Humidity3pm | Humidity at 3pm | Percent | float |
| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |
| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |
| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |
| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |
| Temp9am | Temperature at 9am | Celsius | float |
| Temp3pm | Temperature at 3pm | Celsius | float |
| RainToday | If there was rain today | Yes/No | object |
| RainTomorrow | If there is rain tomorrow | Yes/No | float |
Column definitions were gathered from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
Import the required libraries
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"# Surpress warnings:
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warnimport pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metricsImporting the Dataset
from pyodide.http import pyfetch
async def download(url, filename):
response = await pyfetch(url)
if response.status == 200:
with open(filename, "wb") as f:
f.write(await response.bytes())path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'await download(path, "Weather_Data.csv")
filename ="Weather_Data.csv"df = pd.read_csv("Weather_Data.csv")Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply skip the steps above of "Importing the Dataset" and use the URL directly in the
pandas.read_csv()function. You can uncomment and run the statements in the cell below.
#filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv"
#df = pd.read_csv(filepath)df.head()Data Preprocessing
One Hot Encoding
First, we need to perform one hot encoding to convert categorical variables to binary variables.
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)Training Data and Test Data
Now, we set our 'features' or x values and our Y or target variable.
df_sydney_processed.drop('Date',axis=1,inplace=True)df_sydney_processed = df_sydney_processed.astype(float)features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']Linear Regression
Q1) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.
#Enter Your Code and Executex_train, x_test, y_train, y_test =Q2) Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).
#Enter Your Code and ExecuteLinearReg =Q3) Now use the predict method on the testing data (x_test) and save it to the array predictions.
#Enter Your Code and Executepredictions =Q4) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
#Enter Your Code and ExecuteLinearRegression_MAE =
LinearRegression_MSE =
LinearRegression_R2 =Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.
#Enter Your Code and ExecuteReport =KNN
Q6) Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.
#Enter Your Code and ExecuteKNN =Q7) Now use the predict method on the testing data (x_test) and save it to the array predictions.
#Enter Your Code and Executepredictions =Q8) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
#Enter Your Code and ExecuteKNN_Accuracy_Score =
KNN_JaccardIndex =
KNN_F1_Score =Decision Tree
Q9) Create and train a Decision Tree model called Tree using the training data (x_train, y_train).
#Enter Your Code and ExecuteTree =Q10) Now use the predict method on the testing data (x_test) and save it to the array predictions.
#Enter Your Code and Executepredictions =Q11) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
#Enter Your Code and ExecuteTree_Accuracy_Score =
Tree_JaccardIndex =
Tree_F1_Score =Logistic Regression
Q12) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.
#Enter Your Code and Executex_train, x_test, y_train, y_test =Q13) Create and train a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.
#Enter Your Code and ExecuteLR =Q14) Now, use the predict and predict_proba methods on the testing data (x_test) and save it as 2 arrays predictions and predict_proba.
#Enter Your Code and Executepredictions =predict_proba =Q15) Using the predictions, predict_proba and the y_test dataframe calculate the value for each metric using the appropriate function.
#Enter Your Code and ExecuteLR_Accuracy_Score =
LR_JaccardIndex =
LR_F1_Score =
LR_Log_Loss =SVM
Q16) Create and train a SVM model called SVM using the training data (x_train, y_train).
#Enter Your Code and ExecuteSVM =Q17) Now use the predict method on the testing data (x_test) and save it to the array predictions.
#Enter Your Code and Executepredictions =Q18) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
SVM_Accuracy_Score =
SVM_JaccardIndex =
SVM_F1_Score =Report
Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.
*LogLoss is only for Logistic Regression Model
Report =How to submit
Once you complete your notebook you will have to share it. You can download the notebook by navigating to "File" and clicking on "Download" button.
This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the "My Submission" tab, of the "Peer-graded Assignment" section.
About the Authors:
Joseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.
