Category Archives: AI

Building IBM Watson Voice Engine with Docker & Ubuntu 18.04

Upgrade Repos

apt-get update && apt-get upgrade

Install Docker-CE

Latest info available from Docker directly.

sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL | sudo apt-key add -
sudo add-apt-repository    "deb [arch=amd64] \
$(lsb_release -cs) \
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli

Docker Installer Machine

Latest info from Docker directly.

base= &&
curl -L $base/docker-machine-$(uname -s)-$(uname -m) >/tmp/docker-machine &&
sudo install /tmp/docker-machine /usr/local/bin/docker-machine

Set system Variables

Install Virtualbox

apt-get install virtualbox
docker-machine create default

If you’re using VMWare (I’m using Workstation) you’ll need to enable Virtualizing the Intel VT-x/EPT Virtualisation engine:

Otherwise you’ll get the error: “This computer doesn’t have VT-X/AMD-v enabled. Enabling it in the BIOS is mandatory”
docker-machine create default

Now you have created a default docker-machine you can finally set the system variable to point the system shell at the Docker engine:

eval "$(docker-machine env default)"

Clone Git Repo

git clone

Download Docker Images

docker pull ibmcom/voice-gateway-so:latest && docker pull ibmcom/voice-gateway-mr:latest

This step takes a while, downloads aren’t super quick, but once it’s finished you’re setup and all you need to do is add your credentials.

IBM Watson – Speech to Text (SST)

I’ve been using IBM’s Watson’s Speech to Text engine for transcribing call audio, some possible use cases are speech driven IVRs, Voicemail to Email transcription, or making Call Recordings text-searchable.

The last time I’d played with Speech Recognition on Voice Platforms was in 2012, and it’s amazing to see how far the technology has evolved with the help of AI.

IBM’s offering is a bit more flexible than the Google offering, and allows long transcription (>1 minutes) without uploading the files to external storage.

Sadly, Watson doesn’t have Australian language models out of the box (+1 point to Google which does), but you can add Custom Language Models & train it.

Input formats support PCM coded data, so you can pipe PCMA/PCMU (Aka G.711 µ-law 7 a-law) audio straight to it.

Getting Setup

The first thing you’re going to need are credentials.

For this you’ll need to sign into

Select “Speech to Text” and you can view / copy your API key from the Credentials header.

Once you’ve grabbed your API key we can start transcribing.

Basic Transcription

I’ve got an Asterisk instance that manages Voicemail, so let’s fire the messages to Watson and get it to transcribe the deposited messages:

curl -X POST -u "apikey:yourapikey" --header "Content-Type: audio/wav" --data-binary @msg0059.wav ""
“confidence”: 0.831,
“transcript”: “hi Nick this is Nick leaving Nick a test voice mail “

Common Transcription Options


Speaker labels enable you to identify each speaker in a multi-party call.

This makes the transcription read more like a script with “Speaker 1: Hello other person” “Speaker 2: Hello there Speaker 1”, makes skimming through much easier.


Timestamps timestamp each word based on the start of the audio file,

This reads poorly in CURL but when used with speaker_labels allows you to see the time and correlate it with a recording.

One useful use case is searching through a call recording transcript, and then jumping to that timestamp in the audio.

For example in a long conference call recording you might be interested in when people talked about “Item X”, you can search the call recording for “Item” “X” and find it’s at 1:23:45 and then jump to that point in the call recording audio file, saving yourself an hour and bit of listening to a conference call recording.

Audio formats (content types)

Unfortunately Watson, like GCP, only has support for MULAW (μ-law compounding) and not PCMA as used outside the US.

Luckily it has wide ranging WAV support, something GCP doesn’t, as well as FLAC, G.729, mpg, mp3, webm and ogg.

Speech Recognition Model

Watson has support for US and GB variants of speech recognition, wideband, narrowband and adaptive rate bitrates.


Per word confidence allows you to see a per word confidence breakdown, so you can mark unknown words in the final output with question marks or similar to denote if it’s not confident it has transcribed correctly.

Voice and mail Watson wasn’t sure of


This allows you to specify on either a per-word basis or as a whole, the maximum number of alternatives Watson has for the conversation.

This is Neck a test voicemail