Add High-Performance Speech Keyword Spotting to IoT Designs: Part 2 – Using MCUs
投稿人:DigiKey 北美编辑
2018-10-17
Editor’s Note: Using an emerging class of efficient algorithms, any developer can now deploy sophisticated keyword spotting features on low-power, resource-constrained systems. Part one of this two-part series showed how to do it with FPGAs. Here in Part two we will show how to do it with MCUs.
Keyword spotting (KWS) technology has emerged as an increasingly important feature for wearables, IoT devices, and other smart products. Machine learning methods can provide exceptional KWS accuracy, but the power and performance limitations of these products have until recently limited use of machine learning KWS solutions to the largest enterprises or highly experienced machine learning experts.
However, developers increasingly need to implement voice activated KWS features in wearables and other IoT devices on more efficient KWS engines that are able to operate within the resource constraints of these devices. Depthwise separable convolutional neural network (DS-CNN) architectures modify conventional CNNs to provide the needed efficiency.
Using hardware-optimized neural network library, developers can implement MCU-based DS-CNN inference engines that require minimal resources. This article describes the DS-CNN architecture and shows how developers can implement DS-CNN KWS on MCUs.
Why KWS?
Consumer acceptance of voice activated features on smartphones and home appliances using phrases such as "Alexa," "Hey Siri," or "Ok Google" has rapidly evolved to broad demand for voice services on nearly any product designed for user interaction. Underlying these services, accurate speech recognition relies upon a variety of artificial intelligence methods to identify spoken words, and interpret words and word phrases as commands appropriate to the application.
However, the resources required to quickly and accurately complete this entire voice command sequence starts to exceed the capabilities of low-cost line-powered consumer hubs, much less battery-operated personal electronics.
The voice command pipeline requirements
To deliver voice activation on these products, developers split the voice command pipeline or limit voice commands to a few very simple words such as "on" and "off." On a resource-limited consumer product, developers implement KWS capabilities using neural network inference engines able to deliver the required high accuracy, low latency response to simple commands or command activation phrases for Alexa, Siri, or Google (Figure 1).
Here, the design digitizes the input audio stream, extracts speech features, and passes those features along to a neural network for identification of the keyword.
Figure 1: Voice activation with KWS uses a processing pipeline that extracts frequency domain features from an audio input signal and classifies the extracted features to predict the probability that the input signal corresponds to one of the labels used to train the neural network. (Image source: Arm®)
By converting the amplitude modulated audio input stream to features in a frequency spectrogram, developers can take advantage of the proven ability of convolutional neural network (CNN) models to accurately classify the spoken word according to one of the labels used during neural network training.2 For more complex voice interfaces, the command processing pipeline extends beyond the device itself. After the KWS inference engine detects the activation keyword or phrase, the product passes the digitized audio stream to cloud-based services able to more effectively handle complex speech processing and command recognition operations.
Still, the conflict between device resource availability and inference engine resource requirements has confounded developers' attempts to apply these methods to even smaller designs for wearables and IoT devices. Although the classic CNN is relatively well understood and straightforward to train, these models can still be resource intensive. As the accuracy of CNN models in recognizing images has increased dramatically, CNN size and complexity have also increased significantly.
The result is very accurate CNN models that require billions of compute-intensive general matrix multiply (GEMM) operations for training. Once trained, the corresponding inference models can occupy hundreds of megabytes of memory and require a very large number of GEMM operations for a single inference.
For battery-operated wearables and IoT devices, an effective KWS inference model must be able to run in limited memory with low processing requirements. In addition, because a KWS inference engine must operate in "always on" mode to perform its function, it must be able to operate with minimal power consumption.
This dichotomy between the potential of neural networks and the limited resources in the increasingly attractive arena of wearables and IoT devices has attracted significant attention from machine learning experts. The result has been development of techniques for optimizing the basic CNN model and the appearance of alternative neural network architectures able to bridge the gap between performance requirements and resource capabilities of small resource-constrained devices.
Small footprint models
Among the techniques for creating small footprint models, machine learning experts have applied optimization methods such as network pruning and parameter quantization to produce CNN variants able to deliver results nearly as accurate as full CNNs, but using a fraction of the resources. The success of these reduced precision neural networks paved the way for binarized neural network (BNN) architectures that reduce model parameters from the 32-bit floating-point, or even 16- and 8-bit found in earlier CNNs, down to 1-bit values. As described in Part 1, the Lattice Semiconductor machine learning SensAI™ platform uses this highly efficient BNN architecture as the basis for a 1 milliwatt (mW) KWS solution running on its iCE40 UltraPlus FPGA-based mobile development platform, or MDP.
Along with reductive techniques such as network pruning and parameter quantization, there are other approaches to lowering resource requirements that modify the topology of the CNN architecture itself. Among these alternative architectures, the depthwise separable convolutional neural network offers a particularly effective approach for creating small, resource efficient models able to run on general purpose MCUs.
Building on earlier work, Google machine learning experts found a way to increase the efficiency of CNNs by focusing on the convolution layer itself. In a conventional CNN, each convolution layer filters input features and combines them into a new set of features in a single step (Figure 2, top).
Figure 2: Unlike a full convolution (top), depthwise separable convolution first uses a DKxDK filter (middle) to separately filter each of the M input channels and uses a pointwise 1 x 1 convolution to create N new features. (Image source: Google)
The new approach breaks filtering and feature generation into two separate stages, together called a depthwise separable convolution. The first stage performs a depthwise convolution which acts as a spatial filter on each channel of an input (Figure 2, middle). Because this first stage does not create new features (the core objective of a deep neural network architecture), the second stage performs a pointwise 1 x 1 convolution (Figure 2, bottom) that combines the outputs of the first stage to generate new features.
Used in Google's MobileNet models for mobile and embedded vision applications, this DS-CNN architecture reduces the number of parameters and associated operations, resulting in smaller models that require significantly fewer computations to achieve accurate results.3
Compared to full convolutions, the use of depthwise separable convolutions in MobileNet models reduce accuracy only by 1% on the industry standard ImageNet data set, but use less than 12% of the multiply-add operations and 14% of the number of model parameters required for full convolutions in conventional ImageNet CNN models.
Although DS-CNNs were originally developed for image recognition, these same models can serve audio recognition simply by transforming an audio input stream into a frequency spectrogram to provide a set of usable features. In effect, an audio front end converts the audio stream to a set of features that the DS-CNN can classify. For speech processing, the features produced by the front end typically take the form of Mel-frequency cepstral coefficients (MFCC), which more closely match human auditory characteristics while significantly reducing the dimensionality of the feature set passed to the DS-CNN classifier. This is precisely the approach used in the ARM ML-KWS-for-MCU open-source software repository.
DS-CNN implementation
Designed to demonstrate KWS implementation on Arm Cortex®-M-series MCUs, the ARM KWS repository provides an extensive set of pre-trained TensorFlow models in multiple architectures including conventional CNNs, DS-CNNs, and others. Trained with the Google speed command dataset4, the models classify audio input as one of 12 possible classes: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "silence" (no word spoken), and "unknown" (representing the other words contained in the Google dataset).
Developers can immediately use these pre-trained models to compare inference performance of these alternative neural network architectures and examine their internal structure. For example, after running TensorFlow's import_pb_to_tensorboard Python utility on ARM's pre-trained DS-CNN model, the developer can use TensorBoard to visualize the model's MobileNet-based architecture.
Figure 3: Displayed in TensorBoard, the Arm pre-trained KWS model combines a familiar MobileNet DS-CNN model (red outline, left) with a frequency domain feature extraction stage (expanded, on right) using Mel-frequency cepstral coefficients (MFCC). (Image source: DigiKey)
As visualized in TensorBoard, the MobileNet architecture replaces all but the first full convolution layer in the conventional CNN architecture with depthwise separable convolutions.
As noted earlier, each of these stages includes depthwise convolution and pointwise convolution stages, each feeding into a batchnorm kernel to normalize the output results (Figure 3, left). The DS-CNN model uses a special TensorFlow fused batchnorm function, which combines several options into a single kernel.
In addition, by zooming into the audio input feature extraction stage (Figure 3, right), developers can examine the audio processing sequence including audio decode, spectrogram generation, and MFCC filtering. The features generated by the MFCC pass through a pair of reshape stages to create the tensor shapes required by the MobileNet classifier.
Developers can conceivably run trained models from TensorFlow or other machine learning frameworks on MCU-based systems including the Raspberry Pi.5 With this approach, developers can quantize the trained models to produce smaller versions able to run on these systems. However, without a graphics processing unit (GPU) or other hardware support for GEMM acceleration, inference latency would likely disappoint user expectations for voice activation performance.
ARM provides an alternative approach through its neural network (NN) extension to the ARM Cortex Microcontroller Software Interface Standard (CMSIS). CMSIS-NN provides a complete set of CNN functions that take full advantage of the DSP extensions built into ARM Cortex-M7 processors such as those in STMicroelectronics’ STM32F7 MCU family. Along with conventional CNN functions, the CMSIS-NN application programming interface (API) supports depthwise separable convolutions with a pair of functions corresponding to the depthwise and pointwise 1 x 1 convolution stages underlying DS-CNN architectures:
ARM_status ARM_depthwise_separable_conv_HWC_q7_nonsquare
ARM_status ARM_convolve_1x1_HWC_q7_fast_nonsquare
The API also provides the two functions in versions designed specifically for square input tensors.
ARM uses these functions in sample code that demonstrates a complete DS-CNN-based KWS application running on the STMicroelectronics STM32F746G-DISCO development board built around the STM32F746NGH6 MCU. At the heart of the sample code, a native CMSIS-NN C++ module implements a CS-DNN (Listing 1).
Copy
void DS_CNN::run_nn(q7_t* in_data, q7_t* out_data)
{
//CONV1 : regular convolution
ARM_convolve_HWC_q7_basic_nonsquare(in_data, CONV1_IN_X, CONV1_IN_Y, 1, conv1_wt, CONV1_OUT_CH, CONV1_KX, CONV1_KY, CONV1_PX, CONV1_PY, CONV1_SX, CONV1_SY, conv1_bias, CONV1_BIAS_LSHIFT, CONV1_OUT_RSHIFT, buffer1, CONV1_OUT_X, CONV1_OUT_Y, (q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer1,CONV1_OUT_X*CONV1_OUT_Y*CONV1_OUT_CH);
//CONV3 : DS + PW conv
//Depthwise separable conv (batch norm params folded into conv wts/bias)
ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV3_IN_X,CONV3_IN_Y,CONV2_OUT_CH,conv3_ds_wt,CONV2_OUT_CH,CONV3_DS_KX,CONV3_DS_KY,CONV3_DS_PX,CONV3_DS_PY,CONV3_DS_SX,CONV3_DS_SY,conv3_ds_bias,CONV3_DS_BIAS_LSHIFT,CONV3_DS_OUT_RSHIFT,buffer2,CONV3_OUT_X,CONV3_OUT_Y,(q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer2,CONV3_OUT_X*CONV3_OUT_Y*CONV3_OUT_CH);
//Pointwise conv
ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV3_OUT_X, CONV3_OUT_Y, CONV2_OUT_CH, conv3_pw_wt, CONV3_OUT_CH, 1, 1, 0, 0, 1, 1, conv3_pw_bias, CONV3_PW_BIAS_LSHIFT, CONV3_PW_OUT_RSHIFT, buffer1, CONV3_OUT_X, CONV3_OUT_Y, (q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer1,CONV3_OUT_X*CONV3_OUT_Y*CONV3_OUT_CH);
//CONV4 : DS + PW conv
//Depthwise separable conv (batch norm params folded into conv wts/bias)
ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV4_IN_X,CONV4_IN_Y,CONV3_OUT_CH,conv4_ds_wt,CONV3_OUT_CH,CONV4_DS_KX,CONV4_DS_KY,CONV4_DS_PX,CONV4_DS_PY,CONV4_DS_SX,CONV4_DS_SY,conv4_ds_bias,CONV4_DS_BIAS_LSHIFT,CONV4_DS_OUT_RSHIFT,buffer2,CONV4_OUT_X,CONV4_OUT_Y,(q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer2,CONV4_OUT_X*CONV4_OUT_Y*CONV4_OUT_CH);
//Pointwise conv
ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV4_OUT_X, CONV4_OUT_Y, CONV3_OUT_CH, conv4_pw_wt, CONV4_OUT_CH, 1, 1, 0, 0, 1, 1, conv4_pw_bias, CONV4_PW_BIAS_LSHIFT, CONV4_PW_OUT_RSHIFT, buffer1, CONV4_OUT_X, CONV4_OUT_Y, (q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer1,CONV4_OUT_X*CONV4_OUT_Y*CONV4_OUT_CH);
//CONV5 : DS + PW conv
//Depthwise separable conv (batch norm params folded into conv wts/bias)
ARM_depthwise_separable_conv_HWC_q7_nonsquare(buffer1,CONV5_IN_X,CONV5_IN_Y,CONV4_OUT_CH,conv5_ds_wt,CONV4_OUT_CH,CONV5_DS_KX,CONV5_DS_KY,CONV5_DS_PX,CONV5_DS_PY,CONV5_DS_SX,CONV5_DS_SY,conv5_ds_bias,CONV5_DS_BIAS_LSHIFT,CONV5_DS_OUT_RSHIFT,buffer2,CONV5_OUT_X,CONV5_OUT_Y,(q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer2,CONV5_OUT_X*CONV5_OUT_Y*CONV5_OUT_CH);
//Pointwise conv
ARM_convolve_1x1_HWC_q7_fast_nonsquare(buffer2, CONV5_OUT_X, CONV5_OUT_Y, CONV4_OUT_CH, conv5_pw_wt, CONV5_OUT_CH, 1, 1, 0, 0, 1, 1, conv5_pw_bias, CONV5_PW_BIAS_LSHIFT, CONV5_PW_OUT_RSHIFT, buffer1, CONV5_OUT_X, CONV5_OUT_Y, (q15_t*)col_buffer, NULL);
ARM_relu_q7(buffer1,CONV5_OUT_X*CONV5_OUT_Y*CONV5_OUT_CH);
//Average pool
ARM_avepool_q7_HWC_nonsquare (buffer1,CONV5_OUT_X,CONV5_OUT_Y,CONV5_OUT_CH,CONV5_OUT_X,CONV5_OUT_Y,0,0,1,1,1,1,NULL,buffer2, 2);
ARM_fully_connected_q7(buffer2, final_fc_wt, CONV5_OUT_CH, OUT_DIM, FINAL_FC_BIAS_LSHIFT, FINAL_FC_OUT_RSHIFT, final_fc_bias, out_data, (q15_t*)col_buffer);
}
Listing 1: The ARM ML-KWS-for-MCU software repository includes a C++ DS-CNN model, where a full convolution layer is followed by several depthwise separable convolutions (box), each implemented with depthwise convolution and 1 x 1 convolution functions (yellow highlight) supported in the hardware optimized ARM CMSIS-NN software library. (Code source: ARM)
Although the C++ DS-CNN implementation differs slightly from the TensorBoard DS-CNN model shown earlier, the overall approach remains the same. Following an initial full convolution kernel, a series of depthwise separable convolution kernels feed into final pooling and fully connected layers to generate the prediction values for each output channel (corresponding to the 12 class labels used to train the model).
The KWS application combines this model with code to provide inference of real-time audio streams collected by the STM32F746G-DISCO development board. Here, the main function initializes the inference engine, enables audio sampling, and then enters an endless loop consisting of a single wait-for-interrupt (WFI) call (Listing 2).
Copy
char output_class[12][8] = {"Silence", "Unknown","yes","no","up","down",
"left","right","on","off","stop","go"};
.
.
.
int main()
{
pc.baud(9600);
kws = new KWS_F746NG(recording_win,averaging_window_len);
init_plot();
kws->start_kws();
T.start();
while (1) {
/* A dummy loop to wait for the interrupts. Feature extraction and
neural network inference are done in the interrupt service routine. */
__WFI();
}
}
/*
* The audio recording works with two ping-pong buffers.
* The data for each window will be tranfered by the DMA, which sends
* sends an interrupt after the transfer is completed.
*/
// Manages the DMA Transfer complete interrupt.
void BSP_AUDIO_IN_TransferComplete_CallBack(void)
{
ARM_copy_q7((q7_t *)kws->audio_buffer_in + kws->audio_block_size*4, (q7_t *)kws->audio_buffer_out + kws->audio_block_size*4, kws->audio_block_size*4);
if(kws->frame_len != kws->frame_shift) {
//copy the last (frame_len - frame_shift) audio data to the start
ARM_copy_q7((q7_t *)(kws->audio_buffer)+2*(kws->audio_buffer_size-(kws->frame_len-kws->frame_shift)), (q7_t *)kws->audio_buffer, 2*(kws->frame_len-kws->frame_shift));
}
// copy the new recording data
for (int i=0;i<kws->audio_block_size;i++) {
kws->audio_buffer[kws->frame_len-kws->frame_shift+i] = kws->audio_buffer_in[2*kws->audio_block_size+i*2];
}
run_kws();
return;
}
// Manages the DMA Half Transfer complete interrupt.
void BSP_AUDIO_IN_HalfTransfer_CallBack(void)
{
ARM_copy_q7((q7_t *)kws->audio_buffer_in, (q7_t *)kws->audio_buffer_out, kws->audio_block_size*4);
if(kws->frame_len!=kws->frame_shift) {
//copy the last (frame_len - frame_shift) audio data to the start
ARM_copy_q7((q7_t *)(kws->audio_buffer)+2*(kws->audio_buffer_size-(kws->frame_len-kws->frame_shift)), (q7_t *)kws->audio_buffer, 2*(kws->frame_len-kws->frame_shift));
}
// copy the new recording data
for (int i=0;i<kws->audio_block_size;i++) {
kws->audio_buffer[kws->frame_len-kws->frame_shift+i] = kws->audio_buffer_in[i*2];
}
run_kws();
return;
}
void run_kws()
{
kws->extract_features(); //extract mfcc features
kws->classify(); //classify using dnn
kws->average_predictions();
plot_mfcc();
plot_waveform();
int max_ind = kws->get_top_class(kws->averaged_output);
if(kws->averaged_output[max_ind]>detection_threshold*128/100)
sprintf(lcd_output_string,"%d%% %s",((int)kws->averaged_output[max_ind]*100/128),output_class[max_ind]);
lcd.ClearStringLine(8);
lcd.DisplayStringAt(0, LINE(8), (uint8_t *) lcd_output_string, CENTER_MODE);
}
Listing 2: In the ARM ML-KWS-for-MCU software repository, the main routine for the DS-CNN KWS application instantiates the inference engine (through KWS_F746NG
), activates the STM32F746G-DISCO development board's audio subsystem, and enters an endless loop, waiting for interrupts to call completion routines that perform inference (run_kws()
). (Code source: ARM)
Included in this main
routine, callback functions provide completion routines that buffer the recorded data and begin the inference process itself with a call to run_kws()
. The run_kws
function invokes calls on the inference engine instance to extract features, classify the result, and provide predictions that indicate the probability that the recorded audio sample belongs to one of the 12 classes used in the original training as described previously.
The inference engine itself is instantiated through a series of calls, starting with the call in main
that instantiates the KWS_F746NG
class, which itself is a subclass of the KWS_DS_NN
class. This latter class encapsulates the C++ DS-CNN model shown earlier with a parent class KWS
, which implements the specific inference engine methods: extract_features()
, classify()
, and more (Listing 3).
Copy
#include "kws.h"
KWS::KWS()
{
}
KWS::~KWS()
{
delete mfcc;
delete mfcc_buffer;
delete output;
delete predictions;
delete averaged_output;
}
void KWS::init_kws()
{
num_mfcc_features = nn->get_num_mfcc_features();
num_frames = nn->get_num_frames();
frame_len = nn->get_frame_len();
frame_shift = nn->get_frame_shift();
int mfcc_dec_bits = nn->get_in_dec_bits();
num_out_classes = nn->get_num_out_classes();
mfcc = new MFCC(num_mfcc_features, frame_len, mfcc_dec_bits);
mfcc_buffer = new q7_t[num_frames*num_mfcc_features];
output = new q7_t[num_out_classes];
averaged_output = new q7_t[num_out_classes];
predictions = new q7_t[sliding_window_len*num_out_classes];
audio_block_size = recording_win*frame_shift;
audio_buffer_size = audio_block_size + frame_len - frame_shift;
}
void KWS::extract_features()
{
if(num_frames>recording_win) {
//move old features left
memmove(mfcc_buffer,mfcc_buffer+(recording_win*num_mfcc_features),(num_frames-recording_win)*num_mfcc_features);
}
//compute features only for the newly recorded audio
int32_t mfcc_buffer_head = (num_frames-recording_win)*num_mfcc_features;
for (uint16_t f = 0; f < recording_win; f++) {
mfcc->mfcc_compute(audio_buffer+(f*frame_shift),&mfcc_buffer[mfcc_buffer_head]);
mfcc_buffer_head += num_mfcc_features;
}
}
void KWS::classify()
{
nn->run_nn(mfcc_buffer, output);
// Softmax
ARM_softmax_q7(output,num_out_classes,output);
}
int KWS::get_top_class(q7_t* prediction)
{
int max_ind=0;
int max_val=-128;
for(int i=0;i<num_out_classes;i++) {
if(max_val<prediction[i]) {
max_val = prediction[i];
max_ind = i;
}
}
return max_ind;
}
void KWS::average_predictions()
{
//shift right old predictions
ARM_copy_q7((q7_t *)predictions, (q7_t *)(predictions+num_out_classes), (sliding_window_len-1)*num_out_classes);
//add new predictions
ARM_copy_q7((q7_t *)output, (q7_t *)predictions, num_out_classes);
//compute averages
int sum;
for(int j=0;j<num_out_classes;j++) {
sum=0;
for(int i=0;i<sliding_window_len;i++)
sum += predictions[i*num_out_classes+j];
averaged_output[j] = (q7_t)(sum/sliding_window_len);
}
}
Listing 3: In the ARM DS-CNN KWS application, a KWS module adds methods on the base DS-CNN class needed to perform inference operations including feature extraction, classification, and generation of results smoothed by an averaging window. (Code source: ARM)
All of this software complexity is hidden behind a simple use model where the main routine starts the process by instantiating the inference engine and using its completion routine to perform inference as audio input becomes available. According to ARM, this sample CMSIS-NN implementation running on the STM32F746G-DISCO development board needs only about 12 milliseconds (ms) to complete an inference cycle, which includes audio data buffer copying, feature extraction, and DS-CNN model execution. Just as important, the complete KWS application requires only about 70 Kbytes of memory.
Conclusion
As KWS capability becomes increasingly important as a requirement, developers of resource limited wearables and other IoT designs need small footprint inference engines. Built to leverage DSP features in ARM Cortex-M7 MCUs, the ARM CMSIS-NN provides the foundation for implementing optimized neural network architectures, such as DS-CNNs, able to meet these requirements.
Running on an ARM Cortex-M7 MCU-based development system, a KWS inference engine can achieve performance approaching 10 inferences/s in a memory footprint easily supported by resource limited IoT devices.
References
- Add High-Performance Speech Keyword Spotting to IoT Designs: Part 1 – Using FPGAs, DigiKey
- Get Started with Machine Learning Using Readily Available Hardware and Software, DigiKey
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
- Build a Machine Learning Application with a Raspberry Pi, DigiKey
免责声明:各个作者和/或论坛参与者在本网站发表的观点、看法和意见不代表 DigiKey 的观点、看法和意见,也不代表 DigiKey 官方政策。