Feature Selection for Embedded Machine Learning
2022-11-21 | By ShawnHymel
License: Attribution Arduino
In machine learning, feature selection is the process of choosing which inputs we should use when creating and training a model. By figuring out which features are the most important to a model, we can drop the less important ones, thus saving us computation time, complexity, and memory.
In embedded machine learning, this might be even more important, as features often correspond to sensor input. By dropping sensors, we can also save PCB space, money, and power.
Check out this video if you would like to see feature selection in action on a real embedded machine learning example:
All of the code and plots shown in this tutorial can be found in these two Colab Notebooks:
Feature Selection
Feature selection is related to dimensionality reduction. In both cases, we are trying to reduce the number of values being fed into a machine learning model in order to reduce the computational complexity of that model while minimizing any accuracy loss that we might incur.
Dimensionality reduction often requires a transformation of the data (along various dimensions), which often incurs some computational costs. On the other hand, with feature selection, we can simply drop unimportant features at no cost. The problem is figuring out which features are unimportant. There are a number of techniques to assist with this task.
Feature selection can be broken down into two broad categories: unsupervised and supervised. In unsupervised feature selection, we use various statistical methods to examine the data without training a model or looking at the ground truth values. In supervised feature selection, we need to also examine the ground truth labels (possibly with a model) to determine how effective the features are in predicting those labels.
This post provides a good overview of popular feature selection methods if you would like to learn more.
Unsupervised Example: Pearson Correlation Coefficient
One of the most popular unsupervised feature selection methods is to examine the Pearson correlation coefficient (PCC) between each pair of features.
Let’s try this with a very common and simple dataset: the iris dataset. Each sample contains four features: sepal length, sepal width, petal length, and petal width. Additionally, each sample is associated with one class (the ground truth label) corresponding to the species: Iris Setosa, Iris Versicolour, and Iris Virginica.
It’s usually a good idea to randomly split off 10-20% of the samples and put them aside as our test set so that they do not influence our feature selection process. From there, we plot each pair of features on an X-Y scatter plot to get something like the figure below (note that the diagonal cells are the histograms of each feature–the correlation of one set of samples with itself is simply 1).
This is a great way to see the relationship between each set of features. For example, there seems to be a strong linear correlation between petal length and petal width. Often, this means we can drop one of those values because it is redundant. Such a plot can also show non-linear correlation (x2, etc.).
We can use the Python package pandas to calculate the PCC between each set of features and then use seaborn to create a nice heat map for us.
Note that PCC only gives us linear correlation. Values closer to 1 (or -1) indicate that the two features are more linearly correlated. Values closer to 0 mean that there’s no linear correlation. PCC is not great at providing an estimate of non-linear correlation. As we saw in the scatter plots, petal length and petal width seem to have a high degree of linear correlation. There also seems to be a good amount of correlation between sepal length and petal width/length.
From this information, you might try dropping petal length and/or width. You would still want to try training a model with and without dropped features to see how much accuracy you lose!
Supervised Example: LASSO
Least Absolute Shrinkage and Selection Operator (LASSO) adds a regularization term to the loss function when training a machine learning model (most likely a neural network). For example, let’s say we have a deep neural network with 2 hidden layers (probably overly complicated for the iris set, but let’s go with it anyway). The inputs are the four features, and the outputs are the confidence scores of each class.
We start with our loss function (e.g. categorical cross entropy) and add the L1 regularization term. During training, our optimization function will attempt to minimize the loss function. Adding the regularization term means that the optimization process will attempt to make the weights as small as possible, too.
Normally, regularization is added to help fight overfitting, which you can read more about here. In our case, we can increase the effect of L1 regularization by making the λ term larger (around 0.1). This has the effect of driving the unimportant weights to 0 during training.
For feature selection, we only care about the weights going from the features (inputs) to the first layer of nodes, so we only need to apply our strong L1 term to that first layer. From there, we train the network normally using our training set. Ideally, the neural network (NN) will demonstrate decent accuracy in predicting the desired outcome (iris species, in this case). If not, you will need to use a different model architecture. Supervised feature selection only works well if the model works well.
Once training is complete, we look at the weights of the first layer of nodes. Weights with larger absolute values show that the network believes those features are more important than features that have weights close to 0. This is demonstrated in the figure below with bolder arrows for stronger weights and faded arrows for weights close to 0.
With a large L1 term (λ = 0.1) for the first layer, I trained the NN shown above and plotted the loss and accuracy over time as well as the receiver operating characteristic curve. As you can see, this classifier did a pretty good job (96% on the validation set) at discerning between the different iris species.
Often, inputs have multiple weights going to the first layer, as there are usually multiple nodes in that layer. As a result, it is up to you how to combine those weights when ranking the features for importance. In my example, I took the root mean square (RMS) value of all the weights associated with a feature. Then, I normalized all the RMS values to create a relative ranking system for the features. In my example, I ended up with the following output after training the NN shown.
Feature importance (highest to lowest)
Feature name : RMS value : Normalized RMS
petal width (cm) : 0.22282097 : 0.7078785
petal length (cm) : 0.0814912 : 0.25888887
sepal width (cm) : 0.007265259 : 0.023080954
sepal length (cm) : 0.0031954865 : 0.010151721
From this, we can conclude that the petal width was the most important feature in the NN followed by petal length (despite the apparent correlation between the two!).
From there, I trained a similar neural network with only petal width and length as input features and a less aggressive L1 term (λ = 0.001). The accuracy remained mostly unchanged (96% on the validation set) despite fewer input features! That being said, you should take this with a healthy grain of salt, as this is an extremely simple dataset.
Applying Feature Selection to Embedded Machine Learning
The two feature selection methods shown above (along with the myriad other selection methods) can have a great impact on embedded machine learning. Such techniques often require a good level of experimentation (as does most machine learning) to find a good balance of input features and model accuracy.
In an embedded setting, you also must keep in mind the cost of features beyond computational complexity. Adding a feature might mean adding an additional sensor, which can increase cost, board space, and power consumption.
For the Perfect Toast Machine, I performed the two feature selection methods demonstrated in this article and found that I could drop the ammonia and SGP30 (equivalent CO2 and one set of VOCs) sensors to achieve similar results. I highly recommend checking out the video above to see the performance with 2 fewer sensor boards. The code for the Perfect Toast Machine and feature selection process can be found here.
Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.
Visit TechForum