In traditional speech recognizers language model specifies what word sequence is possible. Deepspeech seems to generate final output based on statistics at letter level (not word level).
I have a language model containing a few hundred words, in arpa:
\data
ngram 1=655
ngram 2=3133
ngram 3=4482
\1-grams:
-3.4836104 0
0 -0.8111794 0
-0.93793285
-2.849284 hi -0.19547889
-2.3366103 good -0.3116794
-3.3399653 afternoon -0.13693075
-2.0126188 two -0.2772886
-2.5213206 plain -0.19897509
-2.185633 bagel -0.4087109
-2.5213206 toasted -0.40743655
…
by which the word sequence “would you like to try our strudel for twenty five cents” is possible. However the final output is not what I expected if language model is used in traditional way.
Here is detailed process:
(1) Building language model -
./lmplz --text corpus.txt --arpa corpus.arpa --o 3
./build_binary -T -s corpus.arpa lm.binary
(2) Building trie -
./generate_trie models/alphabet.txt lm.binary corpus.txt trie
(in the building trie step, alphabet.txt is the original file from Deepspeech release, lm.binary and corpus.txt are my own files from step (1), and trie is the generated new file)
(3) run deepspeech (wave file says “would you like to try our strudel for twenty five cents?”) -
(3.1) First, use my language model with Deepspeech’s original acoustic model (the .pb file) -
deepspeech models/output_graph.pb test13.wav models/alphabet.txt ./lm.binary ./trie
output :
Loading model from file models/output_graph.pb
Loaded model in 0.204s.
Loading language model from files ./lm.binary ./trie
Loaded language model in 0.004s.
Running inference.
would you like to trialastruodle for twenty five cents
Inference took 5.162s for 4.057s audio file.
(3.2) Then use everything of Deepspeech
deepspeech models/output_graph.pb test13.wav models/alphabet.txt models/lm.binary models/trie
output:
Loading model from file models/output_graph.pb
Loaded model in 0.223s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 1.092s.
Running inference.
would i like to trialastruodlefortwentyfvecents
Inference took 5.141s for 4.057s audio file.
(deepspeech-venv)jeremy@levono:~/DeepSpeech$
Now from the output of both runs:
would you like to trialastruodle for twenty five cents
would i like to trialastruodlefortwentyfvecents
Deepspeech seems to use the language model in a way different from the traditional way: the letter sequence such as " trialastruodle" has only rough similarity to what should be the word sequence “try our strudel” which is what the language model contains. It seems that after the neural network generates letter sequences, language model definitely is used to do a second layer processing, so that we can see the results above are different due to the use of different language models. My question is why the strange letter sequence are still there?