How language model is used in deepspeech

(jiping_s) #1

In traditional speech recognizers language model specifies what word sequence is possible. Deepspeech seems to generate final output based on statistics at letter level (not word level).

I have a language model containing a few hundred words, in arpa:
ngram 1=655
ngram 2=3133
ngram 3=4482

-3.4836104 0
0 -0.8111794
-2.849284 hi -0.19547889
-2.3366103 good -0.3116794
-3.3399653 afternoon -0.13693075
-2.0126188 two -0.2772886
-2.5213206 plain -0.19897509
-2.185633 bagel -0.4087109
-2.5213206 toasted -0.40743655

by which the word sequence “would you like to try our strudel for twenty five cents” is possible. However the final output is not what I expected if language model is used in traditional way.

Here is detailed process:

(1) Building language model -

./lmplz --text corpus.txt --arpa --o 3
./build_binary -T -s lm.binary

(2) Building trie -

./generate_trie models/alphabet.txt lm.binary corpus.txt trie

(in the building trie step, alphabet.txt is the original file from Deepspeech release, lm.binary and corpus.txt are my own files from step (1), and trie is the generated new file)

(3) run deepspeech (wave file says “would you like to try our strudel for twenty five cents?”) -

(3.1) First, use my language model with Deepspeech’s original acoustic model (the .pb file) -

deepspeech models/output_graph.pb test13.wav models/alphabet.txt ./lm.binary ./trie

output :

Loading model from file models/output_graph.pb
Loaded model in 0.204s.
Loading language model from files ./lm.binary ./trie
Loaded language model in 0.004s.
Running inference.
would you like to trialastruodle for twenty five cents
Inference took 5.162s for 4.057s audio file.

(3.2) Then use everything of Deepspeech

deepspeech models/output_graph.pb test13.wav models/alphabet.txt models/lm.binary models/trie


Loading model from file models/output_graph.pb
Loaded model in 0.223s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 1.092s.
Running inference.
would i like to trialastruodlefortwentyfvecents
Inference took 5.141s for 4.057s audio file.

Now from the output of both runs:

would you like to trialastruodle for twenty five cents
would i like to trialastruodlefortwentyfvecents

Deepspeech seems to use the language model in a way different from the traditional way: the letter sequence such as " trialastruodle" has only rough similarity to what should be the word sequence “try our strudel” which is what the language model contains. It seems that after the neural network generates letter sequences, language model definitely is used to do a second layer processing, so that we can see the results above are different due to the use of different language models. My question is why the strange letter sequence are still there?

(kdavis) #2

There’s an explanation of how the language model is integrated in to Deep Speech in our blog post A Journey to <10% Word Error Rate.

If you have any questions after reading our blog post, feel free to ask them here.

Thanks for taking the time to dig in to our code!

(jiping_s) #3

Thanks for the reply and the info links. I will read the relevant papers in detail.

Coming back to my tests. The first result contains a letter sequence illegal to the language model:

" trialastruodle"
where the expected words are
"try our strudel"

I assume that this indicates that the decoder has too low confidence in that part of the audio to produce the correct words. To me this is both a bad point and a good point. A bad point in that the decoder should return words only legal to the language model. I also view this behavior as a good point in that it indicates that for that part of the audio the decoder has lower decoding confidence - a useful piece of information. I can use a post processor which checks if there are such illegal letter sequences. If yes, the post processor can apply sequence similarity to transform them into word sequences legal to the language model, while assigning a lower confidence for the next processing stages – NLU, dialogue management and so on.

(Buvana R) #4

@kdavis, thanks a lot for the blog post link. Very informative!!

What were the corpus and vocabulary that went into building your lm.binary and trie that are released at:

Did you use the same vocab.txt that exists under DeepSpeech/data/lm? I see that this vocab.txt is pretty much based on TEDLIUM transcripts. Can you confirm?

(kdavis) #5

The corpus was a combination of Librivox, Fisher, Switchboard training sets along with some other data.

@reuben Built the trie; so, I’ll let him describe that.

(Reuben Morais) #6

The trie file was built from data/lm/vocab.txt