How to Use Embedded Machine Learning to Do Speech Recognition on Arduino
2020-11-23 | By ShawnHymel
License: Attribution Arduino
Speech recognition is a powerful machine learning (ML) tool that allows humans to interact with computers using voice. You might be familiar with Amazon’s Alexa service that allows you to ask questions to a number of devices (like the Echo smart speaker) or issue commands. The ability to process these commands relies on powerful speech recognition software.
Note that the Echo does not do all of the processing itself: it waits to hear the pre-programmed keyword (or wake word), “Alexa.” Any commands given after that keyword are streamed to Amazon’s servers that perform natural language processing (NLP) to figure out what you’re trying to ask.
With that in mind, there are two forms of machine learning going on in such a smart speaker. The first is embedded machine learning, where inference is being performed locally on the device itself (in a microcontroller). The second is more complex machine learning, NLP, that requires the assistance of powerful computers across the Internet.
In this tutorial, I’m going to show you how to create your own keyword spotting system (with custom keywords!) on an Arduino. Note that we will use Edge Impulse to train our neural network model and generate a library for us to use in the Arduino.
See here if you’d like to view this tutorial in video format:
Required Hardware
All you need for this tutorial is an Arduino Nano 33 BLE Sense (and a USB micro cable).
Data Collection
To begin, you will need to download the Google Speech Commands dataset. Unzip (and untar) this collection of audio files somewhere on your computer (I recommend 7-Zip if you are on Windows). Navigate into the extracted directory, cut the _background_noise_ folder, and paste it somewhere outside the speech commands directory. We will need this directory to be separate from the rest of the keyword samples.
From here, you can choose one or two of the keywords found in the Google Speech Commands directory as your target keywords. If you would like to create your own custom keyword(s), keep reading.
Use any audio recording device (your computer, phone, etc.) to record your custom keyword. Try to aim for at least 50 samples with different inflections, pitch, etc. Ideally, you would want to get hundreds or thousands of samples from different people with different voices, genders, accets, etc. to create the most robust model. If you don't, you might find that your keyword spotter only responds to your voice (or similar).
Use a program like Audacity to edit the captured audio. At the bottom-left of the program, change the project rate to 16 kHz to match the sample rate of our target device (the microphone on the Arduino Nano 33 BLE Sense is 16 kHz). You’ll also want to select Tracks > Resample and change the sampling rate of the audio to 16 kHz.
Next, select a 1 second snippet of audio around the utterance of the keyword. It does not have to be exactly 1 second, but note that the curation script in the next part will truncate anything over 1 second or pad 0s to any audio sample less than 1 second.
Click File > Export > Export Selected Audio. Save the sample as a 32-bit floating point WAV file. Repeat this process for all utterances (aiming to get at least 50 samples). I recommend varying where the utterance starts in each sample to help make the model more robust; just try to keep the whole utterance within the 1 second selection.
You’ll want to save the custom keywords in a similar directory structure to the Google Speech Commands dataset. For example, if I use capture samples for “Digi-Key” and “hadouken,” I’ll want the following directory structure where all of my samples are kept:
datasets
|- _background_noise_
|- custom_keywords
|--- digi-key
|--- hadouken
|- data_speech_commands_v0.02
|--- backward
|--- bed
|--- ...
Here, my custom keyword samples go into the digi-key and hadouken directories. Note that the file naming scheme does not matter, as the curation script will read in all files found in a given directory, shuffle them, mix them with background noise, and give them new names.
Data Augmentation and Curation
We want to mix in some background noise (to help make the model recognize the keyword in a variety of environments) and curate the data so that we end up with about 1500 samples in each of our categories (noise, unknown, keyword 1, keyword 2, and so on).
I have put together a script to help with this curation process. Head to https://github.com/ShawnHymel/ei-keyword-spotting and download the repository as a .ZIP file. Unzip it somewhere on your computer. Note that if you move the dataset-curation.py file somewhere else, you will need to move the utils.py file with it, as it contains some functions needed by the curation script.
Note that if you do not wish to run the dataset curation script locally on your computer, I’ve put together a Jupyter Notebook file that you can run in Google Colab. Simply click this link to open the script in Colab. Follow the instructions and execute each cell to curate your data and send it to Edge Impulse.
If you are on Windows, I highly recommend installing Anaconda for this next part. It will help you maintain various versions of different libraries, and it makes installing librosa easier.
Make sure you have Python 3 installed on your computer. I used Python 3.7 for this tutorial. Open a terminal (or Anaconda prompt), and install the following libraries:
python -m pip install librosa numpy soundfile
Navigate to the dataset-curation.py script’s directory and execute the script, giving it some arguments.
python dataset-curation.py \
-t "digi-key, go" -n 1500 -w 1.0 -g 0.1 -s 1.0 -r 16000 -e PCM_16 \
-b "../../Python/datasets/background_noise" \
-o "../../Python/datasets/keywords_curated" \
"../../../Python/datasets/data_speech_commands_v0.02" \
"../../Python/datasets/custom_keywords"
The arguments are as follows:
- -t: Comma-separated list of target keywords
- -n: Number of mixed samples to appear in each category in the output directory (default: 1500)
- -w: Relative volume to multiply each spoken word by (default: 1.0)
- -g: Relative volume to multiply each background noise by (default: 0.1)
- -s: Time (in seconds) of each output clip (default: 1.0)
- -r: Sample rate (in Hz) of each output clip (default: 16000)
- -e: Bit depth of each output sample (default: PCM_16)
- -b: Directory where background noise samples are stored
- -o: Output directory where the mixed audio samples are saved
The last, unlabeled set of arguments is a list of input directories where your raw samples are stored (such as the Google Speech Commands dataset and your custom keywords).
When that’s done, you should have an output folder with a structure that looks like the following (assuming you selected “digi-key” and “go” as your target keywords):
keywords_curated
|- _noise
|- _unknown
|- digi-key
|- go
Feel free to listen to some of the audio files found in these folders. They should have been mixed with random snippets of noise from the _background_noise_ collection.
Model Training with Edge Impulse
Head to edgeimpulse.com and sign up for an account (or log in if you already have one). Create a new project, and give it a name (I’ll call mine “speech-recognition”).
On the left-side pane, go to Data Acquisition. Click the Upload Existing Data button.
You should be presented with an Upload Data screen that lets you select data to send to the Edge Impulse servers. Click Choose Files, highlight all of the WAV files from a single category, and click Open. If you followed the steps above exactly, you should have 1500 files in each category.
Let Edge Impulse automatically split the data between training and test sets, and also let it infer the label from the filename (which is the string before the first ‘.’ in the filename). Click Begin Upload and wait while your files are sent to Edge Impulse.
Repeat this process for all of the other files in the curated dataset. Click on Data Acquisition again to see all of your files on the server. You can hover over the labels to make sure that your files were correctly labeled.
Next, we need to create a pipeline for these samples (what Edge Impulse refers to as an “impulse”). Click on Create Impulse on the left side.
Click Add a processing block and add an Audio (MFCC) block. This will use a sliding window across each audio sample to produce the Mel Frequency Cepstral Coefficients (MFCCs) for that audio sample. See this article to learn more about MFCCs.
Next, click Add a learning block and add a Neural Network (Keras) block. This is a convolutional neural network (CNN) that is used to classify an input of MFCCs. Interestingly enough, the MFCCs look much like an image, so we can use a common image classifier, the CNN, to help identify keywords in each sample. See this article to learn more about CNNs.
Leave all of the settings at their defaults. This seems to work well enough to create a keyword spotting system. Click Save Impulse.
In the left pane, click on MFCC. Click on the Generate Features tab. Click Generate Features. Edge Impulse will compute the MFCCs for each training and test sample.
When that’s done, go to the NN Classifier section. Feel free to play around with the parameters, and you can even define your own neural network! If you don’t want to use the GUI, you can click on the vertical ellipsis menu and select Switch to Keras (expert) mode to create your own neural network in Keras.
However, I recommend leaving all of the parameters at their defaults for this first run (you can always adjust the parameters and try training again). Click Start Training, and wait while the neural network training process runs. When it’s done, you should see an output showing the accuracy and loss of the training and validation sets. These should be lower than when the process first started and have converged around some values. I recommend aiming to get a validation accuracy of at least 85%.
Scroll down to see a confusion matrix of the training set. Above 90% looks really good! Hover your mouse over any one of the values to see what the real vs. actual label is for that entry.
Go to Model Testing and select the checkbox above all of the samples (this will select all of the samples). Click Classify Selected. When that process is finished, you should get the results of how well the test set (samples not used in training) did compared to the model that we just trained.
You might see a difference between the accuracy of the training/validation set and the test set. Here, “go” and “_unknown” performed very poorly in the test set as opposed to the training set.
This is likely due to a problem with the model overfitting (where the model performs better on the training set than it does on unseen data) during training. You can reduce overfitting through a variety of techniques, such as regularization terms, dropout layers, early stopping, and changing the model’s architecture. See here for some tips on how you can combat overfitting. But for now, let’s call this model good enough, considering it’s able to spot the key phrase we really care about: “Digi-Key.”
Head to Deployment, and select the Arduino Library option. Note that you can also download a generic C++ library that will work in almost any build system (so long as that build system can compile C++ code).
Scroll down and click Build. This will auto-generate an Arduino library containing your model and necessary functions to perform inference. Note that you can also click on Analyze Optimizations if you’d like to get an idea of how well your model will perform (the confusion matrix is based on the test set) and how quickly inference will run on an 80 MHz ARM processor.
Deploy to Arduino
Plug in your Arduino Nano 33 BLE Sense and open a new Arduino sketch. Go to Sketch > Include Library > Add .ZIP Library. Select the .zip file you just downloaded from Edge Impulse and click Open.
This should automatically install the auto-generated library we downloaded. Note that you may need to restart the Arduino IDE at this point. Go to File > Examples and locate the library you just installed (this is named “speech-recognition Inferencing” for me). Select the nano_ble33_sense_microphone_continuous example.
Feel free to look through this example sketch to see how the Edge Impulse library is capturing audio and performing inference. Note the inference is performed once every 333 ms (as set by the EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW variable).
The ei_printf() function is needed to assist with debugging and is defined elsewhere in the library. It relies on the Serial.print() function in Arduino to work, so you will get debugging information on the serial terminal.
The microphone_inference_record() function waits for the audio buffer to fill up before continuing. As a result, I recommend putting any code you want to run prior to inference before this function. However, note that if your code takes too long, you could cause the audio buffer to miss samples, resulting in a "buffer overrun" error.
The run_classifier_continuous() uses the audio buffer (signal variable) to extracts features (MFCCs) and performs inference using the trained mode. The results are stored in the result struct.
This struct contains a number of variables that give us information about the output of the neural network. We can read the labels, but note that labels and their associated output values are in a specific order.
To figure out which index corresponds to each label, you’ll need to track it down in the library. Click Sketch > Show Sketch Folder. Go up one directory, and into libraries/<name of Edge Impulse library> (which is ei-speech-recognition-arduino-1.0.1 for me). Then, open the src/model-parameters/model_metadata.h file. Scroll down, and you should see a string array named ei_classifier_inferencing_categories. This array shows you the exact order of the labels. If we’re trying to identify the Digi-Key label, we need to use index 2.
Alternatively, you could watch the output in the Serial Terminal to determine the order of the labels or use strcmp() to compare your target label to the contents of result.classification[ix].label.
Back in the Arduino sketch, let’s enable the onboard LED and flash it whenever the Digi-Key value (more or less the probability that the neural network thinks it heard that keyword) is above a threshold. I’ve copied in some of the code below, including setup() and loop(). The rest of the sketch should remain unchanged.
// OTHER CODE...
static const int led_pin = LED_BUILTIN;
/**
* @brief Arduino setup function
*/
void setup()
{
pinMode(led_pin, OUTPUT);
// put your setup code here, to run once:
Serial.begin(115200);
Serial.println("Edge Impulse Inferencing Demo");
// summary of inferencing settings (from model_metadata.h)
ei_printf("Inferencing settings:\n");
ei_printf("\tInterval: %.2f ms.\n", (float)EI_CLASSIFIER_INTERVAL_MS);
ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) /
sizeof(ei_classifier_inferencing_categories[0]));
run_classifier_init();
if (microphone_inference_start(EI_CLASSIFIER_SLICE_SIZE) == false) {
ei_printf("ERR: Failed to setup audio sampling\r\n");
return;
}
}
/**
* @brief Arduino main function. Runs the inferencing loop.
*/
void loop()
{
bool m = microphone_inference_record();
if (!m) {
ei_printf("ERR: Failed to record audio...\n");
return;
}
signal_t signal;
signal.total_length = EI_CLASSIFIER_SLICE_SIZE;
signal.get_data = µphone_audio_signal_get_data;
ei_impulse_result_t result = {0};
EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);
if (r != EI_IMPULSE_OK) {
ei_printf("ERR: Failed to run classifier (%d)\n", r);
return;
}
// Turn on LED if "Digi-Key" value is above a threshold
if (result.classification[2].value > 0.7) {
digitalWrite(led_pin, HIGH);
} else {
digitalWrite(led_pin, LOW);
}
if (++print_results >= (EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW)) {
// print the predictions
ei_printf("Predictions (DSP: %d ms., Classification: %d ms., Anomaly: %d ms.): \n",
result.timing.dsp, result.timing.classification, result.timing.anomaly);
for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
ei_printf(" %s: %.5f\n", result.classification[ix].label,
result.classification[ix].value);
}
#if EI_CLASSIFIER_HAS_ANOMALY == 1
ei_printf(" anomaly score: %.3f\n", result.anomaly);
#endif
print_results = 0;
}
}
// REST OF CODE...
Note that I used an index of 2 to refer to the target label.
Upload the sketch, and open the Serial Terminal. You should see some timing results along with the output predictions.
Notice that each inference takes about 261 ms for DSP plus about 13 ms for the actual inference (classification). That means every 333 ms, the processor is fully loaded performing these calculations for 274 ms (or ~82% of the time)! That does not leave you much time to do other things in the processor, so keep that in mind when working with machine learning on such a resource constrained device.
You should also see the output of the classifier, which should be your labels with some numbers. Each number essentially gives us the probability for each label. Try saying one of the keywords close to the microphone of the Arduino and watch how those numbers change. In the screenshot above, I said “Digi-Key,” and the digi-key label shot to 0.98, letting me know that it thinks it heard that keyword.
Try saying the other keyword (e.g. “go”). How did it respond? In my experience, short, monosyllabic words perform much worse than polysyllabic words and phrases. Additionally, we know that the model overfit the training data in the previous section, so I would not expect a very accurate model for this keyword.
In my example, if I say “Digi-Key” near the board, the LED should also briefly light up.
Note that the user LED that we’re controlling is the yellow LED (green is power). It’s fairly dim on the Nano 33 BLE Sense, so you might have to look closely.
Resources and Going Further
I hoped that this has helped you get started with your own speech recognition and keyword spotting system in Arduino! While the process is quite involved, we essentially made a “blinky” program based around listening for custom keywords.
Because most of the processor is being used to listen (record audio) and perform inference, you are not left with many resources in the microcontroller. Keep that in mind when you are designing your final project or product. You will need to carefully add code so that you don't overrun your audio buffer or use the microcontroller as a co-processor that just listens for keywords. The co-processor could then notify some other electronics by toggling a GPIO line or by sending an I2C command.
Here are some resources to help you in your embedded machine learning journey:
- Keyword Spotting data curation repository
- Edge Impulse documentation
- Research paper on using CNNs for keyword spotting (by Google)
Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.
Visit TechForum