I have a very specific use-case vocabulary with only 73 distinct English words. I generated a text file containing all possible legal combinations of those words, it had around 2*10^5 lines, and has size 4.4MB. I generated the scorer package using these files. (Using instructions here)
I thought this would be enough, since the acoustic model remains the same (English). I used this scorer combined with the pre-trained v0.7.0 model.pbmm file to run the vad_transcriber example.
However, the results were not good! For example, I said “queen a takes b four” but the output was “horse b to”. I changed value of --aggressive from 0 to 3 without success. When recorded without background noise (ceiling fan), it generated “rex b four”.
I am recording on a 22kHz headset microphone and downsampling to 16kHz using sox . I say one word per second, and the words are clear to me when I hear the downsampled wav file myself.
I have also tried the mic vad streaming example, and it does not produce good transcription either.
Is there anything else that is needed to be done?
PS: transcription is worse when using the pretrained v0.7.0 scorer (it generates some non-chess gibberish, which is kinda expected since it is a general english language scorer ).
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Please first reproduce with deepspeech binaries, not third-party examples.
Make everything reproducible. So, record with a mic and feed the recording. That way, you can try different stuff on the same chunk.
Given you are using the latest released model, what is the output without the scorer argument? That way you can see the pure output of the neural net. If it doesn’t go in the somewhat right direction, a good lm won’t save you
What arguments did you use to produce the lm, standard or modified?
Thanks, I am already using wav files as far as possible.
I ran the mic vad streaming example without the scorer argument (python mic_vad_streaming.py -v 0 -m ~/DeepSpeech/models/models.pbmm), and it seems to produce at least acoustically matching results. For example:
(what i said) => (transcription)
king goes out => ging gos out
queen a takes b four => queen ag igs b ford
queen king night => gueen king niht
the quick brown fox jumped over the lazy dog => the quic trowne fox jumped over the lazy dogk
but as you can see the vocab is all over the place
I used the following commands:
$ python generate_lm.py --input_txt ~/pat/to/in.transcript --output_dir . --kenlm_bins ~/path/to/kenlm/build/bin --arpa_order 4 --max_arpa_memory "90%" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --top_k 50000 --arpa_prune "0" --discount_fallback
Converting to lowercase and counting word occurrences ...
| | # | 200451 Elapsed Time: 0:00:01
Saving top 50000 words ...
Calculating word statistics ...
Your text file has 919303 words in total
It has 73 unique words
Your top-50000 words are 100.0000 percent of all words
Your most common word "takes" occurred 66816 times
The least common word in your top-k is "abort" with 1 times
The first word with 2 occurrences is "game" at place 69
Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /home/gt/otherrepos/DeepSpeech/data/lm/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 919303 types 76
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:912 2:1264540416 3:2371013376 4:3793621504
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 3: D1=0.5 D2=1 D3+=1.5
Statistics:
1 76 D1=0.5 D2=1 D3+=1.5
2 1263 D1=0.5 D2=1 D3+=1.5
3 19067 D1=0.5 D2=1 D3+=1.5
4 114691 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type kB
probing 2494 assuming -p 1.5
probing 2613 assuming -r models -p 1.5
trie 749 without quantization
trie 315 assuming -q 8 -b 8 quantization
trie 731 assuming -a 22 array pointer compression
trie 297 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:7446612 kB VmRSS:9560 kB RSSMax:1455416 kB user:0.372354 sys:0.416396 CPU:0.788805 real:0.817813
Filtering ARPA file using vocabulary of top-k words ...
Reading ./lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Building lm.binary ...
Reading ./lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
and then this:
$ python generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 1.234 --default_beta 1.012
73 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Package created in kenlm.scorer
swig/python detected a memory leak of type 'Alphabet *', no destructor found.
Do you have an accent, the standard model works well for American English? Try speaking slowly and clearly.
Do the generate_lm.py steps yourself, follow these steps and leave out the pruning, a and b bits as this loses information on the way. You’ll need the discount-fallback though for such few words.
Ahh, I finally got it to produce good transcription. For some reason, I was running Deepspeech-examples. I guess that’s because they were quick to setup.
However, now I pip installed the deepspeech module, and ran the same files through it, and voila, i’m getting better transcription! Command used: deepspeech --model models/models.pbmm --scorer models/myown.scorer --audio audio.wav.
I am still not sure though what the problem was with those deepspeech-examples?
I also noticed that passing manually downsampled wav files into the tool resulted in poorer performance, as compared to passing the original wav files. Moreover, when speaking directly into my headset microphone, I get poor results. Now, when I speak at a distance from them, I got improved results.
However, the scorer is still not accurate. And it:
sometimes skips over words even though they were clearly spoken.
still confuses quite a few words because they sound similar
produces some erratic transcription (confuses and skips words at the same time)
I will try adjusting the scorer parameters as you said @othiele.
Also, I get these messages:
...
Warning: original sample rate (22050) is different than 16000hz. Resampling might produce erratic speech recognition.
Running inference.
<<transcription>>
Inference took 6.212s for 48694.288s audio file.
Notice that it gets the time duration of audio file wrong (48694 seconds is way too long ) and shows a warning also. Are there any fixes for this specific issue?
Mic vad streaming and vad transcriber. Found here.
Downsampling with ffmpeg did not improve results. I had 5-6 wav files, after downsampling i get equal or slightly worse transcription. Moreover, I cannot use audacity as this software will run offline on client’s machine
Yep, it is the scorer. When running without scorer, I get longer transcription outputs (i.e. audio does not get skipped)
I set --pruning 0, but what did you mean by “a and b bits”? Are they the the -a and -b for build_binary, or the default_alpha and default_beta for generate_package.py? (the latter arguments are compulsory though)
I built the scorer without -a and -b for build_binary for now. It caught some more words, but still the transcription results are not that great
Thinking about the bad scorer. @reuben, given he builds a custom scorer for the 0.7.1 release, what values should he use for default_alpha and default_beta. The ones from the released page as he used different ones? And could that explain strange predictions?
So, what would be the best usage according to you? I read the DeepSpeech in the wild topic where many have been using DeepSpeech for - what I can tell - production environments?
Meaning running on a server where you can check the results from time to time. If you plan on doing that, fine. Then you could use ffmpeg, … to transform wavs.
As for the default_alpha/beta. If you have a good test set yourself, you could find optimal values with lm_optimizer as described in the release notes:
Subsequent to this the `lm_optimizer.py` was used with the following parameters:
* `lm_alpha_max` 5
* `lm_beta_max` 5
* `n_trials` 2400
* `test_files`LibriSpeech clean dev corpus.
to determine the optimal `lm_alpha` and `lm_beta` with respect to the LibriSpeech clean dev corpus. This resulted in:
* `lm_alpha` 0.931289039105002
* `lm_beta` 1.1834137581510284
I am not too sure whether this will increase recognition a lot, as for other hyperparameters, sometimes you have to play around a bit
Unfortunately I do not have a test set Creating test sets is too much effort as this is only a side project for me. So, I will have to manually experiment with the values, thanks though
Either way you’ll need some sort of test set. Even if it’s only 100 moves. Doesn’t take more than one hour to build and you would have some idea.