Maker.io main logo

How to Run Custom Speech-to-Text (STT) and Text-to-Speech (TTS) Servers

2022-05-16 | By ShawnHymel

License: Attribution Raspberry Pi

 

In a previous tutorial, we created a custom wake word for Mycroft AI. This time, we demonstrate how to set up a speech-to-text (STT) and text-to-speech (TTS) server on a computer that allows Mycroft to perform speech analysis and voice synthesis on a local network.

You can see a walkthrough of this procedure in video form.

Concept

Out of the box, Myrcoft AI must communicate with various services across the internet.

Mycroft STT and TTS services

By default, Mycroft AI performs wake word detection (which we covered previously) on the device (e.g. Raspberry Pi). Once the wake word is heard, it streams audio to a speech-to-text (STT) service that returns a string of what was uttered.

Mycroft (on device) performs intent analysis, matching the returned string to a particular skill. It then communicates with the Mycroft backend to validate the skill and perform any necessary Internet-based actions (e.g. fetch current weather conditions). This information is sent back to the Mycroft service on the device.

Finally, Mycroft generates an audio response (which we covered in the first tutorial). To create an audio output, Mycroft uses a text-to-speech (TTS) service. By default, this is the Mimic 1 TTS engine that runs on the Raspberry Pi. It’s a little robotic, but it’s fast and gets the job done. You can read more about the supported TTS engines here: https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customizations/tts-engine.

If you want to run Mycroft completely locally (within your own local network), you need to host STT, TTS, and the Mycroft backend services.

Mycroft is open source, which means you can run the backend (known as “Selene”) from the following project: https://github.com/MycroftAI/selene-backend. However, it’s configured to set up multiple skills with a skill store for many devices. As such, it’s overkill and complicated for our setup, so we will ignore it for now.

In this tutorial, we will focus on running custom TTS and STT servers on a laptop. Mycroft AI (on the Pi) will communicate with these servers to perform text-to-speech and speech-to-text. It’s not a completely offline Mycroft solution, but it gets us one step closer.

Mycroft with local TTS and STT services

Required Hardware

You will need a Raspberry Pi 3B+ or 4B for this tutorial. You will also need a computer capable of running Ubuntu 18.04 (preferably something with a CUDA-enabled graphics card if you want to speed up computation times).

CUDA Installation

If you’d like to use CUDA, check the compatibility chart here to make sure that your graphics card is capable of running CUDA 10.1 (we specifically want this version for our TTS and STT functions). If you do not want (or can’t run) CUDA, just skip this section.

I tested this on Ubuntu 18.04 LTS. Other versions of Ubuntu may or may not work. Ubuntu 18.04 came with Python 3.6.9, which also worked. My laptop has an Nvidia GTX 1060 Mobile graphics card, which worked for this guide.

See here for a guide on installing Ubuntu. You will want to install this on a computer that will act as a server for your TTS and STT services. When you’re installing Ubuntu, do NOT install third-party apps! This will likely install non-compatible versions of the Nvidia drivers.

When you first boot into Ubuntu, update your software but do not upgrade to Ubuntu 20. If you’re using a laptop like me, you’ll want to disable sleep on lid close. I also installed an SSH server so I could log into the laptop via multiple terminals.

To start the CUDA installation process, check your graphics card version (and make sure it has CUDA support in the chart linked above):

Copy Code
lspci | grep -i nvidia

Make sure you have gcc:

Copy Code
gcc –version

Next, install Linux headers:

Copy Code
sudo apt-get install linux-headers-$(uname -r)

Add the Nvidia package repositories:

Copy Code
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update

Then, install the nvidia driver. Fighting with Nvidia drivers on Linux is the stuff of legends and nightmares. This one worked for me for my graphics card and CUDA 10.1.

Copy Code
sudo apt-get install -y --no-install-recommends nvidia-driver-430
sudo reboot

Check your driver:

Copy Code
nvidia-smi

Next, install the CUDA libraries (you can read about the CUDA 10.1 installation process here):

Copy Code
sudo apt-get install -y --no-install-recommends cuda-10-1 libcudnn7=7.6.4.38-1+cuda10.1 libcudnn7-dev=7.6.4.38-1+cuda10.1

Install cuDNN (you can read about the cuDNN installation process here):

Copy Code
tar -xvf cudnn-10.1-linux-x64-v7.6.5.32.tgz
sudo mkdir -p /usr/local/cuda/include 
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo mkdir -p /usr/local/cuda/lib64
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64 
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

Finally, update your path and environment variables:

Copy Code
nano ~/.bashrc

Copy in the following at the bottom of that document:

Copy Code
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda-10.2/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda
export PATH="$PATH:/usr/local/cuda/bin"

Save and exit. Rerun the file to set the variables in the current session:

Copy Code
source ~/.bashrc

When you’re done, check the current CUDA install. It should be v10.1.

Copy Code
nvcc --version

Install Speech-to-Text Server

DeepSpeech is an open source speech-to-text project maintained by Mozilla. We will use it as our STT service.

If you don’t already have pip and virtualenv, you will need to install them:

Copy Code
sudo apt install python3-pip
sudo pip3 install virtualenv

Next, create a directory and virtual environment for DeepSpeech:

Copy Code
mkdir -p ~/Projects/deepspeech
cd ~/Projects/deepspeech
virtualenv -p python3 deepspeech-venv
source deepspeech-venv/bin/activate

Install DeepSpeech and DeepSpeech server. Note: if you are not using CUDA, then install “deepspeech” instead of “deepspeech-gpu.”

Copy Code
python3 -m pip install deepspeech-gpu deepspeech-server

Download the DeepSpeech model files:

Copy Code
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Download a sample audio file and test DeepSpeech:

Copy Code
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
tar xvf audio-0.9.3.tar.gz
deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav

You should see “experience proves this” printed in the output. If you listen to 2830-3980-0043.wav, you’ll hear someone saying those words.

Let’s create a configuration file for the DeepSpeech server:

Copy Code
nano config.json

In that file, copy in the following (change <username> to your username):

Copy Code
{
  "deepspeech": {
    "model" :"/home/<username>/Projects/deepspeech/deepspeech-0.9.3-models.pbmm",
    "scorer" :/home/<username>Projects/deepspeech/deepspeech-0.9.3-models.scorer",
    "beam_width": 500,
    "lm_alpha": 0.931289039105002,
    "lm_beta": 1.1834137581510284
  },
  "server": {
    "http": {
      "host": "0.0.0.0",
      "port": 5008,
      "request_max_size": 1048576
    }
  },
  "log": {
    "level": [
      { "logger": "deepspeech_server", "level": "DEBUG"}
    ]
  }
}

Save and exit. You can run the server with the following:

Copy Code
deepspeech-server --config config.json

Try sending the test audio file to the server:

Copy Code
curl -X POST --data-binary @2830-3980-0043.wav http://0.0.0.0:5008/stt

If that works, you can create a unit file to run the server on boot with systemd:

Copy Code
sudo nano /etc/systemd/system/stt-server.service

In that file, enter the following (once again, change <username> to your username):

Copy Code
[Unit]
Description=Mozilla DeepSpeech running as a server

[Service]
ExecStart=/home/<username>/Projects/deepspeech/deepspeech-venv/bin/deepspeech-server \
    --config /home/<username>/Projects/deepspeech/config.json

[Install]
WantedBy=multi-user.target

Save and exit. Load and enable the server (to run on boot), and then run it:

Copy Code
sudo systemctl daemon-reload
sudo systemctl enable stt-server.service
sudo systemctl start stt-server.service

You can check the status with:

Copy Code
systemctl status stt-server.service

You can stop and disable the service with the following:

Copy Code
sudo systemctl stop stt-server.service
sudo systemctl disable stt-server.service

You can watch the output of the service with:

Copy Code
journalctl --follow -u stt-server.service

Your STT server should now be running on your computer!

Install Text-to-Speech Server

We will be using the Coqui TTS server, which is a fork of the Mozilla TTS project with a server wrapper.

As with STT, we want to create a separate virtual environment:

Copy Code
mkdir -p ~/Projects/tts
cd ~/Projects/tts
virtualenv -p python3 tts-venv
source tts-venv/bin/activate

Install the TTS package:

Copy Code
python3 -m pip install tts

You can test TTS with (set use_cuda to false if you do not have CUDA installed):

Copy Code
python -m TTS.server.server --use_cuda true

Ctrl+c to stop the server. This will download and use default models. You can list available models with the following:

Copy Code
tts --list_models

You can learn more about the TTS and vocoder models here.

If you would like to use a particular model and listen on a particular port, you can do so with the following command:

Copy Code
tts-server \
--use_cuda true \
--port 5009
--model_name tts_models/en/ljspeech/tacotron2-DDC

Open a browser and point to your server’s IP address. For example, navigate to http://0.0.0.0:5009. You should be greeted with the Coqui page. Feel free to type some string into the input bar and press “Speak.” If your speakers are working, you should hear your string spoken out to you.

Coqui TTS web interface

Next, let’s create a unit file for systemd for our TTS server:

Copy Code
sudo nano /etc/systemd/system/tts-server.service

Copy in the following (change <username> to your username):

Copy Code
[Unit]
Description=Coqui TTS running as a server

[Service]
ExecStart=/home/<username>/Projects/TTS/tts-venv/bin/tts-server \
    --use_cuda true \
    --port 5009 \
    --model_name tts_models/en/ljspeech/tacotron2-DDC 

[Install]
WantedBy=multi-user.target

Save and exit. Load and enable the service. Then, run it:

Copy Code
sudo systemctl daemon-reload
sudo systemctl enable tts-server.service
sudo systemctl start tts-server.service

Check the status with:

Copy Code
systemctl status tts-server.service

You can stop and disable the service with:

Copy Code
sudo systemctl stop tts-server.service
sudo systemctl disable tts-server.service

Watch the output with:

Copy Code
journalctl --follow -u tts-server.service

Your TTS server should now be running on your computer!

Configure Mycroft

Make sure that your STT and TTS servers are running on your computer. Also, make sure that your Mycroft device (e.g. your Raspberry Pi) is connected to the same network as the computer running your STT and TTS services.

Log into Mycroft and exit out of the CLI client (ctrl+c). Enter the following:

Copy Code
mycroft-config edit user

Add the “stt” and “tts” sections to have Mycroft point to our new STT and TTS servers. Here is an example of a full config file. Note that I have “hey jorvon” as a custom trained wake word model from a previous tutorial, so your “wake_word” and “hotword” section might be different.

Copy Code
{ "max_allowed_core_version": 21.2, "listener": {
"wake_word": "hey jorvon"
},
"hotwords": {
"hey jorvon": {
"module": "precise",
"local_model_file": "/home/pi/.local/share/mycroft/precise/hey-jorvon.pb",
“threshold
"sensitivity": 0.5,
"trigger_level": 3
}
},
"stt": {
"deepspeech_server": {
"uri": "http://192.168.1.209:5008/stt"
},
"module": "deepspeech_server"
},
"tts": {
"mozilla": {
"url": "http://192.168.1.209:5009"
},
"module": "mozilla"
}
}

Save and exit. Restart the Mycroft services:

Copy Code
mycroft-start all restart

Start the Mycroft CLI client:

Copy Code
mycroft-cli-client

With any luck, Mycroft should be using your new TTS and STT servers!

Mycroft with custom STT and TTS backend services

If you need debugging output, Mycroft logs can be found in /var/log/mycroft/.

Going Further

See the following guides if you need additional information about STT, TTS, or Mycroft:

制造商零件编号 3367
MINI USB MICROPHONE
Adafruit Industries LLC
¥48.43
Details
制造商零件编号 SC0073
SBC 1.4GHZ 4 CORE 1GB RAM
Raspberry Pi
¥284.90
Details
Add all DigiKey Parts to Cart
TechForum

Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.

Visit TechForum