How to get good transcription results with only a specific English vocabulary?

sigma_g · May 27, 2020, 2:15pm

I have a very specific use-case vocabulary with only 73 distinct English words. I generated a text file containing all possible legal combinations of those words, it had around 2*10^5 lines, and has size 4.4MB. I generated the scorer package using these files. (Using instructions here)

I thought this would be enough, since the acoustic model remains the same (English). I used this scorer combined with the pre-trained v0.7.0 model.pbmm file to run the vad_transcriber example.

However, the results were not good! For example, I said “queen a takes b four” but the output was “horse b to”. I changed value of --aggressive from 0 to 3 without success. When recorded without background noise (ceiling fan), it generated “rex b four”.

I am recording on a 22kHz headset microphone and downsampling to 16kHz using sox . I say one word per second, and the words are clear to me when I hear the downsampled wav file myself.

I have also tried the mic vad streaming example, and it does not produce good transcription either.

Is there anything else that is needed to be done?

PS: transcription is worse when using the pretrained v0.7.0 scorer (it generates some non-chess gibberish, which is kinda expected since it is a general english language scorer ).

lissyx · May 27, 2020, 3:29pm

Please first reproduce with deepspeech binaries, not third-party examples.

othiele · May 27, 2020, 5:46pm

Make everything reproducible. So, record with a mic and feed the recording. That way, you can try different stuff on the same chunk.
Given you are using the latest released model, what is the output without the scorer argument? That way you can see the pure output of the neural net. If it doesn’t go in the somewhat right direction, a good lm won’t save you
What arguments did you use to produce the lm, standard or modified?

sigma_g · May 28, 2020, 5:39am

Thanks for your replies!

@lissyx I was unable to compile the binaries earlier. I will try again and reply here with the results in sometime.

@othiele

Thanks, I am already using wav files as far as possible.
I ran the mic vad streaming example without the scorer argument (python mic_vad_streaming.py -v 0 -m ~/DeepSpeech/models/models.pbmm), and it seems to produce at least acoustically matching results. For example:
- (what i said) => (transcription)
- king goes out => ging gos out
- queen a takes b four => queen ag igs b ford
- queen king night => gueen king niht
- the quick brown fox jumped over the lazy dog => the quic trowne fox jumped over the lazy dogk
but as you can see the vocab is all over the place

I used the following commands:

$ python generate_lm.py --input_txt ~/pat/to/in.transcript --output_dir . --kenlm_bins ~/path/to/kenlm/build/bin --arpa_order 4 --max_arpa_memory "90%" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --top_k 50000 --arpa_prune "0" --discount_fallback
Converting to lowercase and counting word occurrences ...
| |                   #                                                             | 200451 Elapsed Time: 0:00:01

Saving top 50000 words ...

Calculating word statistics ...
  Your text file has 919303 words in total
  It has 73 unique words
  Your top-50000 words are 100.0000 percent of all words
  Your most common word "takes" occurred 66816 times
  The least common word in your top-k is "abort" with 1 times
  The first word with 2 occurrences is "game" at place 69

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /home/gt/otherrepos/DeepSpeech/data/lm/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 919303 types 76
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:912 2:1264540416 3:2371013376 4:3793621504
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 3: D1=0.5 D2=1 D3+=1.5
Statistics:
1 76 D1=0.5 D2=1 D3+=1.5
2 1263 D1=0.5 D2=1 D3+=1.5
3 19067 D1=0.5 D2=1 D3+=1.5
4 114691 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type      kB
probing 2494 assuming -p 1.5
probing 2613 assuming -r models -p 1.5
trie     749 without quantization
trie     315 assuming -q 8 -b 8 quantization 
trie     731 assuming -a 22 array pointer compression
trie     297 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:7446612 kB	VmRSS:9560 kB	RSSMax:1455416 kB	user:0.372354	sys:0.416396	CPU:0.788805	real:0.817813

Filtering ARPA file using vocabulary of top-k words ...
Reading ./lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Building lm.binary ...
Reading ./lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

and then this:

$ python generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 1.234 --default_beta 1.012
73 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Package created in kenlm.scorer
swig/python detected a memory leak of type 'Alphabet *', no destructor found.

othiele · May 28, 2020, 6:50am

Do you have an accent, the standard model works well for American English? Try speaking slowly and clearly.

Do the generate_lm.py steps yourself, follow these steps and leave out the pruning, a and b bits as this loses information on the way. You’ll need the discount-fallback though for such few words.

sigma_g · May 28, 2020, 10:46am

Ahh, I finally got it to produce good transcription. For some reason, I was running Deepspeech-examples. I guess that’s because they were quick to setup.

However, now I pip installed the deepspeech module, and ran the same files through it, and voila, i’m getting better transcription! Command used: deepspeech --model models/models.pbmm --scorer models/myown.scorer --audio audio.wav.

I am still not sure though what the problem was with those deepspeech-examples?

I also noticed that passing manually downsampled wav files into the tool resulted in poorer performance, as compared to passing the original wav files. Moreover, when speaking directly into my headset microphone, I get poor results. Now, when I speak at a distance from them, I got improved results.

However, the scorer is still not accurate. And it:

sometimes skips over words even though they were clearly spoken.
still confuses quite a few words because they sound similar
produces some erratic transcription (confuses and skips words at the same time)

I will try adjusting the scorer parameters as you said @othiele.

Also, I get these messages:

...
Warning: original sample rate (22050) is different than 16000hz. Resampling might produce erratic speech recognition.
Running inference.
<<transcription>>
Inference took 6.212s for 48694.288s audio file.

Notice that it gets the time duration of audio file wrong (48694 seconds is way too long ) and shows a warning also. Are there any fixes for this specific issue?

othiele · May 28, 2020, 3:27pm

Which ones did you use? Just for reference

Check output without scorer to see whether it is the scorer.

Try downsampling with ffmpeg or to get some clean samples use audacity, it exports 16 KHz WAV mono.

sigma_g · May 29, 2020, 7:08am

Mic vad streaming and vad transcriber. Found here.

Downsampling with ffmpeg did not improve results. I had 5-6 wav files, after downsampling i get equal or slightly worse transcription. Moreover, I cannot use audacity as this software will run offline on client’s machine

Yep, it is the scorer. When running without scorer, I get longer transcription outputs (i.e. audio does not get skipped)

I set --pruning 0, but what did you mean by “a and b bits”? Are they the the -a and -b for build_binary, or the default_alpha and default_beta for generate_package.py? (the latter arguments are compulsory though)

I built the scorer without -a and -b for build_binary for now. It caught some more words, but still the transcription results are not that great

othiele · May 29, 2020, 7:45am

I am not sure building an offline application is the best usage considering the current state of deepspeech. It is still in development.

But let’s take it one by one.

You had the problem, that deepspeech showed totally wrong lengths for your input. Is this resolved by using ffmpeg or audacity?

othiele · May 29, 2020, 7:59am

Thinking about the bad scorer. @reuben, given he builds a custom scorer for the 0.7.1 release, what values should he use for default_alpha and default_beta. The ones from the released page as he used different ones? And could that explain strange predictions?

sigma_g · May 29, 2020, 8:00am

Yup, of course. That got resolved.

So, what would be the best usage according to you? I read the DeepSpeech in the wild topic where many have been using DeepSpeech for - what I can tell - production environments?

othiele · May 29, 2020, 8:03am

Meaning running on a server where you can check the results from time to time. If you plan on doing that, fine. Then you could use ffmpeg, … to transform wavs.

othiele · June 2, 2020, 12:19pm

As for the default_alpha/beta. If you have a good test set yourself, you could find optimal values with lm_optimizer as described in the release notes:

Subsequent to this the `lm_optimizer.py` was used with the following parameters:

* `lm_alpha_max` 5
* `lm_beta_max` 5
* `n_trials` 2400
* `test_files`LibriSpeech clean dev corpus.

to determine the optimal `lm_alpha` and `lm_beta` with respect to the LibriSpeech clean dev corpus. This resulted in:

* `lm_alpha` 0.931289039105002
* `lm_beta` 1.1834137581510284

I am not too sure whether this will increase recognition a lot, as for other hyperparameters, sometimes you have to play around a bit

sigma_g · June 2, 2020, 4:05pm

Unfortunately I do not have a test set Creating test sets is too much effort as this is only a side project for me. So, I will have to manually experiment with the values, thanks though

othiele · June 2, 2020, 6:07pm

Either way you’ll need some sort of test set. Even if it’s only 100 moves. Doesn’t take more than one hour to build and you would have some idea.

sigma_g · June 3, 2020, 3:57am

Ah! Alright, I’ll create a test set and get back with results