TensorFlow Lite Tutorial Part 1: Wake Word Feature Extraction

2020-02-03 | By ShawnHymel

Machine Learning (ML) is gaining increasing traction in the tech community, and TensorFlow seems to be the most popular framework at the moment. Using embedded devices for ML is still somewhat new, and a subset of TensorFlow, called “TensorFlow Lite” has been released to allow us to run inferences of models on smaller, lower powered devices like single board computers and microcontrollers.

Over the next few tutorials, we’ll guide you through extracting features from audio files, training a convolutional neural network (CNN) model to detect one spoken word out of many, and deploy that model to a Raspberry Pi to detect that word in real time.

In this tutorial, we will introduce the concept of Mel Frequency Cepstral Coefficients (MFCC) and how to compute them using Python libraries. We will create a set of training data consisting of MFCC samples that will then be fed to the CNN in the next tutorial.

The information contained in this tutorial can also be viewed in video format:

Prerequisites

Before starting this tutorial, you will need to accomplish a few tasks:

If you need an introduction to Edge Artificial Intelligence (Edge AI), we recommend reviewing this tutorial.
You will need to have TensorFlow, Keras, Python, and Jupyter Notebook installed on your desktop or laptop. This tutorial will walk you through the steps to do that.

Alternatively, you can use Google Colab to run Jupyter Notebook on Google’s servers, however, note that loading files (such as training samples) will require you to upload them to your Google Drive and write different code to import them into your program. This guide offers some tips on how to load files in Colab.

Overview

The overall process of how to train and deploy a model might be confusing to newcomers. Hopefully, we can help provide a broad overview of how that works.

Before starting, you’ll want to identify the problem you wish to solve and ask yourself: Is machine learning even necessary to tackle the problem at hand? If yes, then there are generally four steps that you should consider taking:

Steps to train and deploy a machine learning model

First, you will need to collect a bunch of data--more than you think you’ll need. If you plan to train a deep neural network, thousands or hundreds-of-thousands of samples is often necessary.

Second, you will need to spend some time extracting features from those samples. Often, data in its raw form is not usable by a machine learning model, so you will need to transform the data into something that is usable. This can include things like performing a Fourier transform to examine the spectral and power components of a signal, encoding words into numbers, or breaking an image apart into separate color channels.

With the features, we can then train our model. This includes running one or more algorithms, such as backpropagation, to automatically update the model’s parameters (weights and bias terms) so that its predictions more closely match the expected outcome. You might need to try different models or adjust various hyperparameters (values that are set prior to training, such as types of features used and size and shape of the model used) to achieve the desired model accuracy.

Once you have a fully trained model and are happy with the way it performs on unseen data, you can deploy it to a production environment. In this tutorial series, we will convert our model file (.h5) to a TensorFlow Lite model file (.tflite) and copy it to a Raspberry Pi. We will then use the TensorFlow Lite inference engine to make predictions with our model in real time.

Collect Data

Lucky for us, we don’t need to collect data manually. Google has already done this for us in their Google Speech Commands dataset, which consists of thousands of 1 second audio clips of people saying different words. See this page to learn more about it and even contribute your own voice!

To start, download and unzip the Google Speech Commands dataset to your computer.

If you wish to add your own wake word to the set, you will need to create a folder in the dataset directory and collect a bunch of examples of people saying that word. You can probably get away with just you saying the word, but there’s a good chance that the model will learn to recognize your voice over others rather than just that keyword.

Mel Frequency Cepstral Coefficients

If we view a sound clip as a function of value over time, we get the familiar waveform shape shown below (a 1-second audio clip saying “stop”). While some ML models might be able to take this time-sequence of values and use it to identify wake words, they do not seem to work as well as models that train to approach listening to words in the same way humans do.

Extracting MFCCs from audio clip

As humans, our ears and brain function by listening to frequencies over a period of time (think about how music is made up of notes, which are just sound frequencies played for a short duration). To accomplish that with a computer, we take a moving window (a time slice) of the given audio file and compute the Fast Fourier Transform (FFT) of that time slice. That gives us an indication of power at each frequency component for that time slice.

We then filter that power graph using a Mel-spaced filterbank. These are filters that are spaced at linear (under 1 kHz) and logarithmic (over 1 kHz) intervals across the frequency spectrum, which emulates how human ears perceive sound.

The energy under each of these filters is added up to create a vector of numbers. For most applications, the logarithm of these numbers is computed and the discrete cosine transform (DCT) is then performed on the vector. The output of these operations provides us with the Mel Frequency Cepstral Coefficient (MFCC) terms. The lower elements (e.g. elements 0 and 1) provide information about the overall shape of the spectrum in that time slice where as the higher elements (above 2) give information about the finer details (noise, etc.).

In most machine learning applications for automatic speech recognition (ASR), only elements 1-13 are used. To make things simpler for our neural network in the next tutorial, we’ll keep elements 0-15. Feel free to play with the values used to see if you can make the neural network faster or more accurate.

Extracting MFCCs from audio clip

We then slide the window over some and compute the next set of MFCCs. We keep sliding the window over until we’ve computed all the MFCCs for that particular audio sample. For our purposes, we end up with a 16 by 16 array of MFCC values that give us an idea of the waveform’s spectral shape over time.

Another way to view this MFCC array is as a grayscale image. We had matplotlib provide false colors to make the image easier to visualize. If we compare the MFCC image of the spoken word “stop” to “zero,” we can see how there are variances in the spectral components. While it might be difficult for a human to see these differences, we can easily train a neural network to identify them.

This is a very basic overview of MFCCs and why they’re useful. If you’d like to learn more, see this article.

Run Python Script to Extract Features

To extract these features, we’ll be using the python_speech_features library, which you can install via pip.

Navigate to https://github.com/ShawnHymel/tflite-speech-recognition and download the Jupyter Notebook and Python files. Start a Jupyter Notebook session on your computer and open 01-speech-commands-mfcc-extraction.

If you do not have one of the packages listed in the first cell (or throughout the Notebook), you can install it by running the following in its own cell:

Copy Code

!pip install <name of package>

Change the dataset_path variable to point to the Google Speech Commands dataset directory on your computer.

I recommend running the Notebook one cell at a time and reading the comments and output to get a feel for what is happening in the script.

Splitting a data set for machine learning

Before computing MFCCs, we first combine all of the filenames into one, long list. We then randomly shuffle that list, and set aside 10% for cross validation, 10% for testing, and the other 80% for training. You will often see 10%-20% of data set aside for each validation and test sets in many machine learning projects.

Note that part of this process includes associating ground truth values to the samples. We need to assign a number value to each spoken word, so we go in alphabetical order of the directories. “Backward” is assigned 0, “bed” is assigned 1, “bird” is assigned 2, and so on. These numbers will be used by the model training algorithm to know how far off its prediction is from the actual answer, and this loss value will be used to update the weights and biases in the model.

The training sample set will be used to train the model. Cross validation occurs on each pass of training to get an accuracy or loss score. This helps us know if the model is making progress on (mostly) unseen data--will it work on data that it did not train with? Finally, the test set should be put aside and not touched until the very end. We will use it to get a final accuracy and loss score for the model once it has been thoroughly trained.

When you reach the script labeled # TEST: Test shorter MFCC, you can change the idx variable to point to a single sample in the training set. The cell will play the sample through your computer’s speakers and give you the MFCCs in number and image form.

Copy Code

# TEST: Test shorter MFCC
# !pip install playsound

from playsound import playsound

idx = 13

# Create path from given filename and target item
path = join(dataset_path, target_list[int(y_orig_train[idx])], 
            filenames_train[idx])

# Create MFCCs
mfccs = calc_mfcc(path)
print("MFCCs:", mfccs)

# Plot MFCC
fig = plt.figure()
plt.imshow(mfccs, cmap='inferno', origin='lower')

# TEST: Play problem sounds
print(target_list[int(y_orig_train[idx])])
playsound(path)

Note that some samples in the dataset are incomplete. They stop short of 1 second or are corrupted. If we cannot read the file or they do not produce exactly 16 sets of MFCCs, we simply drop the sample. There are myriad ways to deal with bad data, such as removing or padding values that approximate noise, but the simplest is often to just leave out the bad samples. This Speech Commands dataset seems to have about 10% bad samples, so leaving them out will work fine for the next training step.

After computing the MFCCs for all of the good samples, we save them as tensors in the all_targets_mfcc_sets.npz file.

A tensor is an any-dimensional collection of numbers. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, and so on.

We can load this file in our next tutorial and use the features it contains to train our neural network.

Going Further

Even though we have not done any machine learning yet, this tutorial still has a lot of information to take in! Extracting features is an extremely important part of any ML project. In fact, many researchers claim to spend more time on feature extraction than actual machine learning.

Here are some resources to help you in your journey. In the next tutorial, we will tackle training a neural network with our features.

Mel Frequency Cepstral Coefficients:

Splitting data into training, validation, and test sets:

Recommended Reading

TensorFlow Lite Tutorial Part 2: Speech Recognition Model Training

TensorFlow Lite Tutorial Part 3: Speech Recognition on Raspberry Pi