ML-Based Rain Prediction
2023-11-27 | By Maker.io Staff
License: See Original Project
This article explains how to build a machine-learning model for predicting precipitation. It discusses obtaining training data, exploring the samples, finding possible correlations in the data, preprocessing the samples to facilitate their use in training, and finally, choosing and building a model.
Source: https://pixabay.com/illustrations/artificial-neural-network-ann-3501528/
Prerequisites
Before getting started, make sure to install Python3, Jupyter Lab, as well as the following packages:
pip install jupyterlab pandas numpy seaborn scikit-learn Flask
Further, the article assumes you’re familiar with commonly used ML terminology and have experience with Jupyter and interactive notebooks. The project uses this data set from Kaggle. As it contains weather data collected in Seattle, it might not fit your geographic location, which can lead to poor model performance when making predictions for other regions. However, you can apply the techniques in this article to other data to improve the prediction quality for your area.
Loading and Exploring the Data
Start by including the following modules in your Jupyter Notebook:
import pandas as pd
import seaborn as sns
import numpy as np
import pickle
from sklearn import metrics
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
With the Pandas library, loading data from an external CSV file is as simple as calling a single function:
df = pd.read_csv("~/documents/seattle-weather.csv")
You can then output the df object to list a few samples and investigate their structure:
This image shows an excerpt of the original Seattle weather data set.
The data contains six attributes and 1461 samples. The weather column is the value to predict in the final model and is a nominal label. I also output all possible values it can contain. We’ll later have to convert this nominal label to a numeric value, but for now, removing the date and precipitation columns can help reduce the dimensionality — which in turn can speed up training and classification.
I then wanted to get a feeling for the minimum and maximum values of all attributes to determine whether any of the samples contained outliers that could interfere with the training process:
from numpy.lib.function_base import average
print("Min Temp: " + str(min(df['temp_min'])))
print("Max Temp: " + str(max(df['temp_max'])))
print("Min Precipitation: " + str(min(df['precipitation'])))
print("Max Precipitation: " + str(max(df['precipitation'])))
print("Min Wind: " + str(min(df['wind'])))
print("Max Wind: " + str(max(df['wind'])))
Running the snippet produces the following output:
Min Temp: -7.1
Max Temp: 35.6
Min Precipitation: 0.0
Max Precipitation: 55.9
Min Wind: 0.4
Max Wind: 9.5
It seems the spread is not excessive, and normalization or outlier removal is not required. Finally, I also wanted to check whether any missing values need imputation or removal:
As the output shows, the data doesn’t contain any samples with missing values.
Data Preprocessing
Start by modifying the date to reduce the data’s dimensionality. I decided to keep the month, as it makes sense to include it in predictions due to a higher likelihood of rain during certain seasons. The following code converts all date strings to date objects and then extracts the month as a numeric value from all samples. The second line then renames the date column to month:
df['date'] = pd.to_datetime(df['date'], errors='coerce').dt.month
df = df.rename(columns={"date": "month"})
Next, the nominal weather data is transformed into a numeric label. As the model should perform binary classification, the label to predict must either be one (rain) or zero (no rain):
df = df.replace(regex = "rain|drizzle|snow", value=1)
df = df.replace(regex = "sun|fog", value=0)
Replacing the weather label exposes a problem. Some samples have a precipitation value of zero, but the output label is one. The mismatch indicates that one of the two values needs to be corrected. Likely, drizzle was not considered precipitation, or some data samples are simply inaccurate. As this problem only affects around 6.5% of the samples, I removed them from the data set. The remaining examples still suffice to perform adequate model training:
df = df.drop(df[(df['precipitation'] == 0) & (df['weather'] == 1)].index)
Finally, I renamed the weather column to rain and removed the precipitation column:
df = df.rename(columns={"weather": "rain"})
df = df.drop(['precipitation'], axis=1)
Data Visualization
Although visualizing the data is optional, it can help you understand the relationships between attributes and how they might affect the model’s decision-making process. First, I wanted to inspect whether the data reflects my theory that some months have more rain than others:
Correlating the month and minimum temperature of a day seems to reveal a statistical relationship between the attributes and the target label.
The orange dots indicate rain, and the blue ones represent samples without rain. The plot shows that some months, in combination with certain minimum temperature values, exhibit a higher percentage of positive samples (rain) than negative ones. For example, a minimum temperature on a November day above 2.5 degrees Celsius indicates an increased likelihood of rain that day. We can observe a similar relationship between the minimum and maximum temperatures. The following plot shows an almost perfect linear separation of the data:
Plotting the minimum and maximum temperatures on a 2D plane shows that the label seems linearly separable.
Finally, the wind and maximum temperature of the day also seemingly possess a noteworthy relationship, and they form two observable clusters:
Increasing wind and decreasing maximum temperature increase the likelihood of rain on any given day in the data.
The almost perfect linear separation based on the temperature indicates that a decision tree might be a good choice, as a temperature-based split would almost perfectly divide the dataset into two equal parts. Similarly, the clustering shown in the final plot indicates that distance-based methods, such as kNN, could also be a reasonable choice, especially since the data doesn’t contain any outliers or excessive differences in value scaling.
Model Training
Based on the above plots, I decided to use a decision tree as the classifier. They are easy to understand, and the decisions can be visualized by following the nodes along the decision path. Note that I shortened this section dramatically to keep this project more concise. However, the downloadable notebook contains more details and comparisons of different classifiers and their performance.
Start by splitting the complete data set into training and test subsets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
The X_train and X_test sets contain only the values without the label. These subsets are used for training and testing the model. The training set contains 70% of all samples, while the y_train and y_test sets contain only the label to predict. As we perform supervised learning, the y_train set is used during training. The y_test set is only used to calculate the model’s performance when predicting the samples in X_test.
Next, train a decision tree with default parameters using the following commands:
t = DecisionTreeClassifier()
t = t.fit(X_train.values, y_train.values)
Then, make it perform prediction on all samples in X_test and compare the results to the correct labels stored in y_test:
y_pred_t = t.predict(X_test.values)
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_t))
The initial tree reaches an accuracy of around 70%, which is reasonable, and the result doesn’t indicate any significant underfitting or overfitting issues. However, visualizing the tree reveals that it is overly complex and can likely be optimized:
This image shows the initial decision tree’s structure, which is overly complex.
Optimizing the Initial Classifier
The initial decision tree’s predictions are accurate in around 70% of the cases, which is good but not ideal. Aside from that, the overly complex structure also offers room for improvement. A technique called hyperparameter tuning can help achieve both of these goals. In this approach, a search algorithm tweaks specific parameters of the tree model, then retrains it using the same data and compares how tuning the parameters affects the prediction accuracy. In addition, I also replaced the holdout method with cross-validation to tackle overfitting.
You could quickly implement a naive exhaustive search that tries all possible combinations. However, using the built-in search functions performs tuning in a more heuristic manner and leads to faster convergence:
# Create a decision tree classifier
gst = DecisionTreeClassifier()
# Define the hyperparameters to search over
params = {'max_depth': [2, 4, 8, 12], 'min_samples_split': [2, 4, 8], 'min_samples_leaf': [1, 2, 4]}
# Use cross-validation to tune the hyperparameters
search = GridSearchCV(gst, params, cv=5, scoring='accuracy')
search.fit(X, y)
You can define the parameters and the values the search should try and then pass them to the search object, which, in this case, performs a GridSearch with five-fold cross-validation. It rates the performance of each tree based on the accuracy. Note that you must use the entire dataset and all the labels when fitting the tree in this manner, as cross-validation performs the splitting.
Once the search concludes, you can retrieve the best parameters and the achieved accuracy from the search object:
# Print the best hyperparameters and score
print("Best hyperparameters:", search.best_params_)
print("Best score:", search.best_score_)
The search reveals that the best tree has an accuracy of 75%, and it found the optimal parameters to be:
As before, I decided to keep this section concise and excluded hyperparameter tuning for the other models. However, you can find the code in the notebook linked below.
Building the Optimal Decision Tree
Based on the search result, the following lines build and save the optimal decision tree for the Seattle weather data:
t8 = DecisionTreeClassifier(max_depth=8, min_samples_leaf=4, min_samples_split=8)
t8 = t8.fit(X_train.values, y_train.values)
Saving the Machine Learning Model
The next part of this series will expose the trained ML model to users via an online API. The trained model must be exported from the Jupyter Notebook to be loaded into a standard Python application later. Pickle is a tool commonly used for exporting and importing trained ML models, and saving the model is as easy as calling a single function:
pickle.dump(t8, open('model.pkl', 'wb'))
Download the Jupyter Notebook
Use this link to view the notebook with more comprehensive explanations and experiments.
Conclusion
Building a machine learning model that can successfully make predictions is less complicated than it may initially seem. The process starts with gathering data for model training. Ideally, the data should reflect the environment and situation in which the finished model will be deployed. Using data from different contexts or geographic regions may lead to poor model performance once deployed.
The next step involves investigating the data and finding potential problems. Note that the data set used in this example is near perfect and only contains a few inconsistencies that could easily be removed. Usually, real-world data will be much more incomplete and noisier, and more thorough preprocessing is typically required.
Visualizing the data helps you to understand the relationships between the attributes, which, in turn, assists in selecting a fitting model. Further, visualizations can help identify outliers and patterns in the data. In this particular example, I decided to use a decision tree due to the low dimensionality of the data and the model’s simplicity, performance, and explainability. However, other algorithms, such as kNN or even neural networks, could’ve also been viable options.
The last step involves training, evaluating, and improving the model. Many approaches exist, but hyperparameter tuning with cross-validation is commonly used. Together, these two methods allow for finding the optimal settings for algorithms and tackling overfitting. However, searching suitable parameters, especially exhaustively, can be a computationally costly endeavor, particularly for algorithms with many parameters such as neural networks.
Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.
Visit TechForum