First contact with Deep Speech

Hey guys,

recently I’ve started my journey in the domain of ASR. My plan was to build an assistant to better understand the technologies used. With this in mind i settled with the repository from @dan.bmh, which already streamlines a lot of the preprocessing work needed. While working trough the steps, some questions came to my mind, of which @dan.bmh provided me with an answer. Maybe those can be helpful for some of you and while i proceed my journey I would like to extend on those or provide more information.

Note that the scorer isn’t used for training. Training is just for the acoustic model. The scorer is used for the test validation set after training finishes though.

1 Like

@dabinat I had a question based on the following:

Looking at

It appears that this takes Rasa intents and converts them into a text file (sentences.txt) such as:

open the door
close the door
answer phone
make coffee

Running the following scripts (

python --input_txt sentences.txt --output_dir . --top_k 500000 --kenlm_bins ../../../kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback
python --alphabet ../alphabet.txt --lm lm.binary --vocab sentences.txt --package commands.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

I believe this generates a trie based on words in the sentence and not the specific sentences, so for example based on the above sentences, the phrase “make the phone” could be returned? Is there a way to guide the scorer towards one of the phrases listed?

According to - transfer learning for the acoustic model should be used to retrain the alphabet using utf-8, which allows a language model that includes spaces? There is specifically a reference to replacing spaces with “|” on that page.

Could you clarify what lead you to that understanding from reading the page? I don’t want it to mislead people and I certainly did not intend to give the impression that you should use UTF-8 mode if you want to transfer to a different alphabet… and it certainly is not related to using a language model that includes spaces…

Sure, I’ll try and explain my thought process as I was reading the page. The context when I was reading is the use case above - matching specific sequences of words (commands)

Section: Default mode (alphabet based)
Word based means the text corpus used to build the scorer should contain words separated by whitespace

Here I was thinking that in my alphabet, whitespace (specifically spaces) is relevant, so I don’t want to use it as a separator.

Section: UTF-8 mode
UTF-8 scorers are character based (more specifically, Unicode codepoint based), but the way they are used is similar to a word based scorer where each “word” is a sequence of UTF-8 bytes representing a single Unicode codepoint

Specifically, I am trying to match words. So my thinking is that my alphabet is the words I am trying to match on in the specific phrases, e.g.

  • open
  • the
  • door
  • close
  • answer
  • phone
  • coffee

From the language modeling perspective, this is a character based model. From the implementation perspective, this is a word based model, because each character is composed of multiple labels.

Again, this reinforced my belief that this was the correct approach.

Because KenLM uses spaces as a word separator, the resulting language model will not include space characters in it. If you wish to use UTF-8 mode but still model spaces, you need to replace spaces in the input corpus with a different character before converting it to space separated codepoints…

Hope this helps. Given that I have seen command matching in this forum a few times but couldn’t find an answer for this, could you please clarify the following?

Is it possible to build a scorer that would accept the following phrases:

  • open the door
  • close the door
  • answer phone
  • make coffee

But that would not match the phrase:

  • answer the coffee

Ah, that makes perfect sense. Thanks for explaining so clearly! I need to expand on the decoder and scorer documents to clarify what character-based and word-based mean and the fact that these characteristics apply to the language model only.

To do what you want would require some significant changes to the code. UTF-8 mode is actually going in the opposite direction that you want: instead of predicting longer sequences (words) directly, it predicts shorter ones (bytes instead of codepoints). The default mode also assumes each character in the alphabet is a single Unicode codepoint long, so putting entire words in your alphabet file won’t work.

Another problem with simply putting entire words in your alphabet file is that the acoustic model is making predictions over overlapping windows of the input. By default, these windows are 32ms wide and have a step size of 20ms. This means your model would be making independent predictions of an entire word at every 20ms of input audio, which doesn’t make a lot of sense.

I’d encourage you to start with a less invasive approach. Use an English alphabet (if your data is in English), and train a word-based language model with your target phrases. See how that performs, and go from there.

The scorer does not accept phrases, it encodes probabilities for n-grams. If your input corpus does not have the answer the coffee n-gram, then the scorer will give it a very low probability.

Thanks for clarifying, your explanation definitely makes sense.

I had actually already tried a word-based language model using the following corpus and the generate_package scripts:

  • add task
  • away
  • change channel
  • delete task
  • finish task
  • complete task
  • list commands
  • list tasks

I built a small web application that picks a random command from a list of commands and a recording prompt. The users audio is then fed to the server and passed to the model, and the result is compared with the random command. (happy to provide the source code if this is something that would be useful for DeepSpeech-Examples)

In my testing, this approach works very well for some phrases, and poorly for others.

Total success rate: 59%

Add Task - 83%
Away - 100%
Change Channel - 17%
Delete Task - 86%
Finish Task - 0%
Complete Task - 17%
List Commands - 100%
List Tasks - 67%

Some notes about the application:

  1. The random phrase uses a bucket distribution, so each command has an even representation.
  2. American accents matched a little better, 68% was the highest overall accuracy.
  3. Only exact matches are counted. So for example “list task” wouldn’t work. Given that it works on n-grams, calculating the Jaro distance and checking against threshold would definitely improve the accuracy.

So definitely off to a good start, I was looking for ways to improve from here.

Just an update on this, after some experimenting, it appears that the some of the time, the streaming recording from the browser was being garbled, which was causing the high error rate. After listening to some of the recordings, I am surprised it was ever able to match them!

I changed to uploading the audio after recording had finished and the success rate climbed to over 90%

1 Like

@gazler this is an interesting approach. And you’re right with your assumptions about Jaco’s scorer generation.

Regarding your <90% accuracy, in a coffee ordering benchmark (has about 100 distinct words) I could reach 0.913 without mixing noise into the dataset. So I’m not sure this approach improves accuracy much.

How do you think your sentence scorer scales with bigger datasets with many sentences?

You can find my benchmark code here: If you’re interested you could run a new benchmark for comparison and create a merge request with your result. I think this wouldn’t be much work, mostly exchanging the scorer files. Using only a subset of the voice commands (100/620) benchmark runtime should be about 1-3h.

Can you make sure you are sending proper audio from the browser? The model expects by default WAVE 16kHz mono little-endian