How to Run Custom Speech-to-Text (STT) and Text-to-Speech (TTS) Servers

2022-05-16 | By ShawnHymel

License: Attribution Raspberry Pi

In a previous tutorial, we created a custom wake word for Mycroft AI. This time, we demonstrate how to set up a speech-to-text (STT) and text-to-speech (TTS) server on a computer that allows Mycroft to perform speech analysis and voice synthesis on a local network.

You can see a walkthrough of this procedure in video form.

Concept

Out of the box, Myrcoft AI must communicate with various services across the internet.

Mycroft STT and TTS services

By default, Mycroft AI performs wake word detection (which we covered previously) on the device (e.g. Raspberry Pi). Once the wake word is heard, it streams audio to a speech-to-text (STT) service that returns a string of what was uttered.

Mycroft (on device) performs intent analysis, matching the returned string to a particular skill. It then communicates with the Mycroft backend to validate the skill and perform any necessary Internet-based actions (e.g. fetch current weather conditions). This information is sent back to the Mycroft service on the device.

Finally, Mycroft generates an audio response (which we covered in the first tutorial). To create an audio output, Mycroft uses a text-to-speech (TTS) service. By default, this is the Mimic 1 TTS engine that runs on the Raspberry Pi. It’s a little robotic, but it’s fast and gets the job done. You can read more about the supported TTS engines here: https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customizations/tts-engine.

If you want to run Mycroft completely locally (within your own local network), you need to host STT, TTS, and the Mycroft backend services.

Mycroft is open source, which means you can run the backend (known as “Selene”) from the following project: https://github.com/MycroftAI/selene-backend. However, it’s configured to set up multiple skills with a skill store for many devices. As such, it’s overkill and complicated for our setup, so we will ignore it for now.

In this tutorial, we will focus on running custom TTS and STT servers on a laptop. Mycroft AI (on the Pi) will communicate with these servers to perform text-to-speech and speech-to-text. It’s not a completely offline Mycroft solution, but it gets us one step closer.

Mycroft with local TTS and STT services

Required Hardware

You will need a Raspberry Pi 3B+ or 4B for this tutorial. You will also need a computer capable of running Ubuntu 18.04 (preferably something with a CUDA-enabled graphics card if you want to speed up computation times).

CUDA Installation

If you’d like to use CUDA, check the compatibility chart here to make sure that your graphics card is capable of running CUDA 10.1 (we specifically want this version for our TTS and STT functions). If you do not want (or can’t run) CUDA, just skip this section.

I tested this on Ubuntu 18.04 LTS. Other versions of Ubuntu may or may not work. Ubuntu 18.04 came with Python 3.6.9, which also worked. My laptop has an Nvidia GTX 1060 Mobile graphics card, which worked for this guide.

See here for a guide on installing Ubuntu. You will want to install this on a computer that will act as a server for your TTS and STT services. When you’re installing Ubuntu, do NOT install third-party apps! This will likely install non-compatible versions of the Nvidia drivers.

When you first boot into Ubuntu, update your software but do not upgrade to Ubuntu 20. If you’re using a laptop like me, you’ll want to disable sleep on lid close. I also installed an SSH server so I could log into the laptop via multiple terminals.

To start the CUDA installation process, check your graphics card version (and make sure it has CUDA support in the chart linked above):

Copy Code

lspci | grep -i nvidia

Make sure you have gcc:

Copy Code

gcc –version

Next, install Linux headers:

Copy Code

sudo apt-get install linux-headers-$(uname -r)

Add the Nvidia package repositories:

Copy Code

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update

Then, install the nvidia driver. Fighting with Nvidia drivers on Linux is the stuff of legends and nightmares. This one worked for me for my graphics card and CUDA 10.1.

Copy Code

sudo apt-get install -y --no-install-recommends nvidia-driver-430
sudo reboot

Check your driver:

Copy Code

nvidia-smi

Next, install the CUDA libraries (you can read about the CUDA 10.1 installation process here):

Copy Code

sudo apt-get install -y --no-install-recommends cuda-10-1 libcudnn7=7.6.4.38-1+cuda10.1 libcudnn7-dev=7.6.4.38-1+cuda10.1

Install cuDNN (you can read about the cuDNN installation process here):

Copy Code

tar -xvf cudnn-10.1-linux-x64-v7.6.5.32.tgz
sudo mkdir -p /usr/local/cuda/include 
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo mkdir -p /usr/local/cuda/lib64
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64 
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

Finally, update your path and environment variables:

Copy Code

nano ~/.bashrc

Copy in the following at the bottom of that document:

Copy Code

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda-10.2/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda
export PATH="$PATH:/usr/local/cuda/bin"

Save and exit. Rerun the file to set the variables in the current session:

Copy Code

source ~/.bashrc

When you’re done, check the current CUDA install. It should be v10.1.

Copy Code

nvcc --version

Install Speech-to-Text Server

DeepSpeech is an open source speech-to-text project maintained by Mozilla. We will use it as our STT service.

If you don’t already have pip and virtualenv, you will need to install them:

Copy Code

sudo apt install python3-pip
sudo pip3 install virtualenv

Next, create a directory and virtual environment for DeepSpeech:

Copy Code

mkdir -p ~/Projects/deepspeech
cd ~/Projects/deepspeech
virtualenv -p python3 deepspeech-venv
source deepspeech-venv/bin/activate

Install DeepSpeech and DeepSpeech server. Note: if you are not using CUDA, then install “deepspeech” instead of “deepspeech-gpu.”

Copy Code

python3 -m pip install deepspeech-gpu deepspeech-server

Download the DeepSpeech model files:

Copy Code

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Download a sample audio file and test DeepSpeech:

Copy Code

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
tar xvf audio-0.9.3.tar.gz
deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav

You should see “experience proves this” printed in the output. If you listen to 2830-3980-0043.wav, you’ll hear someone saying those words.

Let’s create a configuration file for the DeepSpeech server:

Copy Code

nano config.json

In that file, copy in the following (change <username> to your username):

Copy Code

{
  "deepspeech": {
    "model" :"/home/<username>/Projects/deepspeech/deepspeech-0.9.3-models.pbmm",
    "scorer" :/home/<username>Projects/deepspeech/deepspeech-0.9.3-models.scorer",
    "beam_width": 500,
    "lm_alpha": 0.931289039105002,
    "lm_beta": 1.1834137581510284
  },
  "server": {
    "http": {
      "host": "0.0.0.0",
      "port": 5008,
      "request_max_size": 1048576
    }
  },
  "log": {
    "level": [
      { "logger": "deepspeech_server", "level": "DEBUG"}
    ]
  }
}

Save and exit. You can run the server with the following:

Copy Code

deepspeech-server --config config.json

Try sending the test audio file to the server:

Copy Code

curl -X POST --data-binary @2830-3980-0043.wav http://0.0.0.0:5008/stt

If that works, you can create a unit file to run the server on boot with systemd:

Copy Code

sudo nano /etc/systemd/system/stt-server.service

In that file, enter the following (once again, change <username> to your username):

Copy Code

[Unit]
Description=Mozilla DeepSpeech running as a server

[Service]
ExecStart=/home/<username>/Projects/deepspeech/deepspeech-venv/bin/deepspeech-server \
    --config /home/<username>/Projects/deepspeech/config.json

[Install]
WantedBy=multi-user.target

Save and exit. Load and enable the server (to run on boot), and then run it:

Copy Code

sudo systemctl daemon-reload
sudo systemctl enable stt-server.service
sudo systemctl start stt-server.service

You can check the status with:

Copy Code

systemctl status stt-server.service

You can stop and disable the service with the following:

Copy Code

sudo systemctl stop stt-server.service
sudo systemctl disable stt-server.service

You can watch the output of the service with:

Copy Code

journalctl --follow -u stt-server.service

Your STT server should now be running on your computer!

Install Text-to-Speech Server

We will be using the Coqui TTS server, which is a fork of the Mozilla TTS project with a server wrapper.

As with STT, we want to create a separate virtual environment:

Copy Code

mkdir -p ~/Projects/tts
cd ~/Projects/tts
virtualenv -p python3 tts-venv
source tts-venv/bin/activate

Install the TTS package:

Copy Code

python3 -m pip install tts

You can test TTS with (set use_cuda to false if you do not have CUDA installed):

Copy Code

python -m TTS.server.server --use_cuda true

Ctrl+c to stop the server. This will download and use default models. You can list available models with the following:

Copy Code

tts --list_models

You can learn more about the TTS and vocoder models here.

If you would like to use a particular model and listen on a particular port, you can do so with the following command:

Copy Code

tts-server \
--use_cuda true \
--port 5009
--model_name tts_models/en/ljspeech/tacotron2-DDC

Open a browser and point to your server’s IP address. For example, navigate to http://0.0.0.0:5009. You should be greeted with the Coqui page. Feel free to type some string into the input bar and press “Speak.” If your speakers are working, you should hear your string spoken out to you.

Coqui TTS web interface

Next, let’s create a unit file for systemd for our TTS server:

Copy Code

sudo nano /etc/systemd/system/tts-server.service

Copy in the following (change <username> to your username):

Copy Code

[Unit]
Description=Coqui TTS running as a server

[Service]
ExecStart=/home/<username>/Projects/TTS/tts-venv/bin/tts-server \
    --use_cuda true \
    --port 5009 \
    --model_name tts_models/en/ljspeech/tacotron2-DDC 

[Install]
WantedBy=multi-user.target

Save and exit. Load and enable the service. Then, run it:

Copy Code

sudo systemctl daemon-reload
sudo systemctl enable tts-server.service
sudo systemctl start tts-server.service

Check the status with:

Copy Code

systemctl status tts-server.service

You can stop and disable the service with:

Copy Code

sudo systemctl stop tts-server.service
sudo systemctl disable tts-server.service

Watch the output with:

Copy Code

journalctl --follow -u tts-server.service

Your TTS server should now be running on your computer!

Configure Mycroft

Make sure that your STT and TTS servers are running on your computer. Also, make sure that your Mycroft device (e.g. your Raspberry Pi) is connected to the same network as the computer running your STT and TTS services.

Log into Mycroft and exit out of the CLI client (ctrl+c). Enter the following:

Copy Code

mycroft-config edit user

Add the “stt” and “tts” sections to have Mycroft point to our new STT and TTS servers. Here is an example of a full config file. Note that I have “hey jorvon” as a custom trained wake word model from a previous tutorial, so your “wake_word” and “hotword” section might be different.

Copy Code

{ "max_allowed_core_version": 21.2, "listener": {
    "wake_word": "hey jorvon"
  },
  "hotwords": {
    "hey jorvon": {
      "module": "precise",
      "local_model_file": "/home/pi/.local/share/mycroft/precise/hey-jorvon.pb",
      “threshold
      "sensitivity": 0.5,
      "trigger_level": 3
    }
  },
  "stt": {
    "deepspeech_server": {
      "uri": "http://192.168.1.209:5008/stt"
    },
    "module": "deepspeech_server"
  }, 
  "tts": {
    "mozilla": {
      "url": "http://192.168.1.209:5009"
    },
    "module": "mozilla"
  }
}