ML-Based Rain Prediction

2023-11-27 | By Maker.io Staff

License: See Original Project

This article explains how to build a machine-learning model for predicting precipitation. It ‎discusses obtaining training data, exploring the samples, finding possible correlations in the data, ‎preprocessing the samples to facilitate their use in training, and finally, choosing and building a ‎model.‎

model_1

Source: https://pixabay.com/illustrations/artificial-neural-network-ann-3501528/

Prerequisites

Before getting started, make sure to install Python3, Jupyter Lab, as well as the following ‎packages:‎

Copy Code

pip install jupyterlab pandas numpy seaborn scikit-learn Flask

Further, the article assumes you’re familiar with commonly used ML terminology and have ‎experience with Jupyter and interactive notebooks. The project uses this data set from Kaggle. ‎As it contains weather data collected in Seattle, it might not fit your geographic location, which ‎can lead to poor model performance when making predictions for other regions. However, you ‎can apply the techniques in this article to other data to improve the prediction quality for your ‎area.‎

Loading and Exploring the Data

Start by including the following modules in your Jupyter Notebook:‎

Copy Code

import pandas as pd
import seaborn as sns
import numpy as np
import pickle

from sklearn import metrics
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

With the Pandas library, loading data from an external CSV file is as simple as calling a single ‎function:‎

Copy Code

df = pd.read_csv("~/documents/seattle-weather.csv")

You can then output the df object to list a few samples and investigate their structure:‎

table_2

This image shows an excerpt of the original Seattle weather data set.‎

The data contains six attributes and 1461 samples. The weather column is the value to predict in ‎the final model and is a nominal label. I also output all possible values it can contain. We’ll later ‎have to convert this nominal label to a numeric value, but for now, removing the date and ‎precipitation columns can help reduce the dimensionality — which in turn can speed up training ‎and classification.‎

I then wanted to get a feeling for the minimum and maximum values of all attributes to determine ‎whether any of the samples contained outliers that could interfere with the training process:‎

Copy Code

from numpy.lib.function_base import average
print("Min Temp: " + str(min(df['temp_min'])))
print("Max Temp: " + str(max(df['temp_max'])))
print("Min Precipitation: " + str(min(df['precipitation'])))
print("Max Precipitation: " + str(max(df['precipitation'])))
print("Min Wind: " + str(min(df['wind'])))
print("Max Wind: " + str(max(df['wind'])))

Running the snippet produces the following output:‎

Copy Code

Min Temp: -7.1
Max Temp: 35.6
Min Precipitation: 0.0
Max Precipitation: 55.9
Min Wind: 0.4
Max Wind: 9.5

‎It seems the spread is not excessive, and normalization or outlier removal is not required. Finally, ‎I also wanted to check whether any missing values need imputation or removal:‎

empty-values

As the output shows, the data doesn’t contain any samples with missing values.‎

Data Preprocessing‎

Start by modifying the date to reduce the data’s dimensionality. I decided to keep the month, as it ‎makes sense to include it in predictions due to a higher likelihood of rain during certain seasons. ‎The following code converts all date strings to date objects and then extracts the month as a ‎numeric value from all samples. The second line then renames the date column to month:‎

Copy Code

df['date'] = pd.to_datetime(df['date'], errors='coerce').dt.month
df = df.rename(columns={"date": "month"})

Next, the nominal weather data is transformed into a numeric label. As the model should perform ‎binary classification, the label to predict must either be one (rain) or zero (no rain):‎

Copy Code

df = df.replace(regex = "rain|drizzle|snow", value=1)
df = df.replace(regex = "sun|fog", value=0)

Replacing the weather label exposes a problem. Some samples have a precipitation value of ‎zero, but the output label is one. The mismatch indicates that one of the two values needs to be ‎corrected. Likely, drizzle was not considered precipitation, or some data samples are simply ‎inaccurate. As this problem only affects around 6.5% of the samples, I removed them from the ‎data set. The remaining examples still suffice to perform adequate model training:‎

Copy Code

df = df.drop(df[(df['precipitation'] == 0) & (df['weather'] == 1)].index)

Finally, I renamed the weather column to rain and removed the precipitation column:‎

Copy Code

df = df.rename(columns={"weather": "rain"})
df = df.drop(['precipitation'], axis=1)

Data Visualization‎

Although visualizing the data is optional, it can help you understand the relationships between ‎attributes and how they might affect the model’s decision-making process. First, I wanted to ‎inspect whether the data reflects my theory that some months have more rain than others:‎

graph-temp-month

Correlating the month and minimum temperature of a day seems to reveal a statistical ‎relationship between the attributes and the target label.‎

The orange dots indicate rain, and the blue ones represent samples without rain. The plot shows ‎that some months, in combination with certain minimum temperature values, exhibit a higher ‎percentage of positive samples (rain) than negative ones. For example, a minimum temperature ‎on a November day above 2.5 degrees Celsius indicates an increased likelihood of rain that day. ‎We can observe a similar relationship between the minimum and maximum temperatures. The ‎following plot shows an almost perfect linear separation of the data:‎

graph-temp-min-max

Plotting the minimum and maximum temperatures on a 2D plane shows that the label seems ‎linearly separable.‎

Finally, the wind and maximum temperature of the day also seemingly possess a noteworthy ‎relationship, and they form two observable clusters:‎

graph-temp-wind

Increasing wind and decreasing maximum temperature increase the likelihood of rain on any ‎given day in the data.‎

The almost perfect linear separation based on the temperature indicates that a decision tree ‎might be a good choice, as a temperature-based split would almost perfectly divide the dataset ‎into two equal parts. Similarly, the clustering shown in the final plot indicates that distance-based ‎methods, such as kNN, could also be a reasonable choice, especially since the data doesn’t ‎contain any outliers or excessive differences in value scaling.‎

Model Training‎

Based on the above plots, I decided to use a decision tree as the classifier. They are easy to ‎understand, and the decisions can be visualized by following the nodes along the decision path. ‎Note that I shortened this section dramatically to keep this project more concise. However, the ‎downloadable notebook contains more details and comparisons of different classifiers and their ‎performance.‎

Start by splitting the complete data set into training and test subsets:‎

Copy Code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

The X_train and X_test sets contain only the values without the label. These subsets are used for ‎training and testing the model. The training set contains 70% of all samples, while the y_train and ‎y_test sets contain only the label to predict. As we perform supervised learning, the y_train set is ‎used during training. The y_test set is only used to calculate the model’s performance when ‎predicting the samples in X_test.‎

Next, train a decision tree with default parameters using the following commands:‎

Copy Code

t = DecisionTreeClassifier()
t = t.fit(X_train.values, y_train.values)

Then, make it perform prediction on all samples in X_test and compare the results to the correct ‎labels stored in y_test:‎

Copy Code

y_pred_t = t.predict(X_test.values)
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_t))

The initial tree reaches an accuracy of around 70%, which is reasonable, and the result doesn’t ‎indicate any significant underfitting or overfitting issues. However, visualizing the tree reveals ‎that it is overly complex and can likely be optimized:‎

tree-classifier-standard

This image shows the initial decision tree’s structure, which is overly complex.‎

Optimizing the Initial Classifier

The initial decision tree’s predictions are accurate in around 70% of the cases, which is good but ‎not ideal. Aside from that, the overly complex structure also offers room for improvement. A ‎technique called hyperparameter tuning can help achieve both of these goals. In this approach, a ‎search algorithm tweaks specific parameters of the tree model, then retrains it using the same ‎data and compares how tuning the parameters affects the prediction accuracy. In addition, I also ‎replaced the holdout method with cross-validation to tackle overfitting.‎

You could quickly implement a naive exhaustive search that tries all possible combinations. ‎However, using the built-in search functions performs tuning in a more heuristic manner and ‎leads to faster convergence:‎

Copy Code

# Create a decision tree classifier
gst = DecisionTreeClassifier()

# Define the hyperparameters to search over
params = {'max_depth': [2, 4, 8, 12], 'min_samples_split': [2, 4, 8], 'min_samples_leaf': [1, 2, 4]}

# Use cross-validation to tune the hyperparameters
search = GridSearchCV(gst, params, cv=5, scoring='accuracy')
search.fit(X, y)

You can define the parameters and the values the search should try and then pass them to the ‎search object, which, in this case, performs a GridSearch with five-fold cross-validation. It rates ‎the performance of each tree based on the accuracy. Note that you must use the entire dataset ‎and all the labels when fitting the tree in this manner, as cross-validation performs the splitting.‎

Once the search concludes, you can retrieve the best parameters and the achieved accuracy ‎from the search object:‎

Copy Code

# Print the best hyperparameters and score
print("Best hyperparameters:", search.best_params_)
print("Best score:", search.best_score_)

The search reveals that the best tree has an accuracy of 75%, and it found the optimal ‎parameters to be:‎

search-results

As before, I decided to keep this section concise and excluded hyperparameter tuning for the ‎other models. However, you can find the code in the notebook linked below.

‎Building the Optimal Decision Tree‎

Based on the search result, the following lines build and save the optimal decision tree for the ‎Seattle weather data:‎

Copy Code

t8 = DecisionTreeClassifier(max_depth=8, min_samples_leaf=4, min_samples_split=8)
t8 = t8.fit(X_train.values, y_train.values)

Saving the Machine Learning Model‎

The next part of this series will expose the trained ML model to users via an online API. The ‎trained model must be exported from the Jupyter Notebook to be loaded into a standard Python ‎application later. Pickle is a tool commonly used for exporting and importing trained ML models, ‎and saving the model is as easy as calling a single function:‎

Copy Code

pickle.dump(t8, open('model.pkl', 'wb'))

Download the Jupyter Notebook‎

Use this link to view the notebook with more comprehensive explanations and experiments.‎

Conclusion‎

Building a machine learning model that can successfully make predictions is less complicated ‎than it may initially seem. The process starts with gathering data for model training. Ideally, the ‎data should reflect the environment and situation in which the finished model will be deployed. ‎Using data from different contexts or geographic regions may lead to poor model performance ‎once deployed.‎

The next step involves investigating the data and finding potential problems. Note that the data ‎set used in this example is near perfect and only contains a few inconsistencies that could easily ‎be removed. Usually, real-world data will be much more incomplete and noisier, and more ‎thorough preprocessing is typically required.‎

Visualizing the data helps you to understand the relationships between the attributes, which, in ‎turn, assists in selecting a fitting model. Further, visualizations can help identify outliers and ‎patterns in the data. In this particular example, I decided to use a decision tree due to the low ‎dimensionality of the data and the model’s simplicity, performance, and explainability. However, ‎other algorithms, such as kNN or even neural networks, could’ve also been viable options.‎

The last step involves training, evaluating, and improving the model. Many approaches exist, but ‎hyperparameter tuning with cross-validation is commonly used. Together, these two methods ‎allow for finding the optimal settings for algorithms and tackling overfitting. However, searching ‎suitable parameters, especially exhaustively, can be a computationally costly endeavor, ‎particularly for algorithms with many parameters such as neural networks.