How to get good transcription results with only a specific English vocabulary?

I have a very specific use-case vocabulary with only 73 distinct English words. I generated a text file containing all possible legal combinations of those words, it had around 2*10^5 lines, and has size 4.4MB. I generated the scorer package using these files. (Using instructions here)

I thought this would be enough, since the acoustic model remains the same (English). I used this scorer combined with the pre-trained v0.7.0 model.pbmm file to run the vad_transcriber example.

However, the results were not good! For example, I said “queen a takes b four” but the output was “horse b to”. I changed value of --aggressive from 0 to 3 without success. When recorded without background noise (ceiling fan), it generated “rex b four”.

I am recording on a 22kHz headset microphone and downsampling to 16kHz using sox . I say one word per second, and the words are clear to me when I hear the downsampled wav file myself.

I have also tried the mic vad streaming example, and it does not produce good transcription either.

Is there anything else that is needed to be done?

PS: transcription is worse when using the pretrained v0.7.0 scorer (it generates some non-chess gibberish, which is kinda expected since it is a general english language scorer :stuck_out_tongue:).

1 Like

Please first reproduce with deepspeech binaries, not third-party examples.

  1. Make everything reproducible. So, record with a mic and feed the recording. That way, you can try different stuff on the same chunk.

  2. Given you are using the latest released model, what is the output without the scorer argument? That way you can see the pure output of the neural net. If it doesn’t go in the somewhat right direction, a good lm won’t save you :slight_smile:

  3. What arguments did you use to produce the lm, standard or modified?

Thanks for your replies!

@lissyx I was unable to compile the binaries earlier. I will try again and reply here with the results in sometime.

@othiele

  1. Thanks, I am already using wav files as far as possible.

  2. I ran the mic vad streaming example without the scorer argument (python mic_vad_streaming.py -v 0 -m ~/DeepSpeech/models/models.pbmm), and it seems to produce at least acoustically matching results. For example:

    • (what i said) => (transcription)
    • king goes out => ging gos out
    • queen a takes b four => queen ag igs b ford
    • queen king night => gueen king niht
    • the quick brown fox jumped over the lazy dog => the quic trowne fox jumped over the lazy dogk

    but as you can see the vocab is all over the place :upside_down_face:

  3. I used the following commands:

    $ python generate_lm.py --input_txt ~/pat/to/in.transcript --output_dir . --kenlm_bins ~/path/to/kenlm/build/bin --arpa_order 4 --max_arpa_memory "90%" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --top_k 50000 --arpa_prune "0" --discount_fallback
    Converting to lowercase and counting word occurrences ...
    | |                   #                                                             | 200451 Elapsed Time: 0:00:01
    
    Saving top 50000 words ...
    
    Calculating word statistics ...
      Your text file has 919303 words in total
      It has 73 unique words
      Your top-50000 words are 100.0000 percent of all words
      Your most common word "takes" occurred 66816 times
      The least common word in your top-k is "abort" with 1 times
      The first word with 2 occurrences is "game" at place 69
    
    Creating ARPA file ...
    === 1/5 Counting and sorting n-grams ===
    Reading /home/gt/otherrepos/DeepSpeech/data/lm/lower.txt.gz
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Unigram tokens 919303 types 76
    === 2/5 Calculating and sorting adjusted counts ===
    Chain sizes: 1:912 2:1264540416 3:2371013376 4:3793621504
    Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
    Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
    Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
    Substituting fallback discounts for order 3: D1=0.5 D2=1 D3+=1.5
    Statistics:
    1 76 D1=0.5 D2=1 D3+=1.5
    2 1263 D1=0.5 D2=1 D3+=1.5
    3 19067 D1=0.5 D2=1 D3+=1.5
    4 114691 D1=0.5 D2=1 D3+=1.5
    Memory estimate for binary LM:
    type      kB
    probing 2494 assuming -p 1.5
    probing 2613 assuming -r models -p 1.5
    trie     749 without quantization
    trie     315 assuming -q 8 -b 8 quantization 
    trie     731 assuming -a 22 array pointer compression
    trie     297 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
    === 3/5 Calculating and sorting initial probabilities ===
    Chain sizes: 1:912 2:20208 3:381340 4:2752584
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ####################################################################################################
    === 4/5 Calculating and writing order-interpolated probabilities ===
    Chain sizes: 1:912 2:20208 3:381340 4:2752584
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ####################################################################################################
    === 5/5 Writing ARPA model ===
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Name:lmplz	VmPeak:7446612 kB	VmRSS:9560 kB	RSSMax:1455416 kB	user:0.372354	sys:0.416396	CPU:0.788805	real:0.817813
    
    Filtering ARPA file using vocabulary of top-k words ...
    Reading ./lm.arpa
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    
    Building lm.binary ...
    Reading ./lm_filtered.arpa
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Identifying n-grams omitted by SRI
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Quantizing
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Writing trie
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    SUCCESS
    

    and then this:

    $ python generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 1.234 --default_beta 1.012
    73 unique words read from vocabulary file.
    Doesn't look like a character based model.
    Using detected UTF-8 mode: False
    Package created in kenlm.scorer
    swig/python detected a memory leak of type 'Alphabet *', no destructor found.
    

Do you have an accent, the standard model works well for American English? Try speaking slowly and clearly.

Do the generate_lm.py steps yourself, follow these steps and leave out the pruning, a and b bits as this loses information on the way. You’ll need the discount-fallback though for such few words.

Ahh, I finally got it to produce good transcription. For some reason, I was running Deepspeech-examples. I guess that’s because they were quick to setup.

However, now I pip installed the deepspeech module, and ran the same files through it, and voila, i’m getting better transcription! Command used: deepspeech --model models/models.pbmm --scorer models/myown.scorer --audio audio.wav.

I am still not sure though what the problem was with those deepspeech-examples?

I also noticed that passing manually downsampled wav files into the tool resulted in poorer performance, as compared to passing the original wav files. Moreover, when speaking directly into my headset microphone, I get poor results. Now, when I speak at a distance from them, I got improved results.

However, the scorer is still not accurate. And it:

  1. sometimes skips over words even though they were clearly spoken.
  2. still confuses quite a few words because they sound similar
  3. produces some erratic transcription (confuses and skips words at the same time)

I will try adjusting the scorer parameters as you said @othiele.

Also, I get these messages:

...
Warning: original sample rate (22050) is different than 16000hz. Resampling might produce erratic speech recognition.
Running inference.
<<transcription>>
Inference took 6.212s for 48694.288s audio file.

Notice that it gets the time duration of audio file wrong (48694 seconds is way too long :stuck_out_tongue: ) and shows a warning also. Are there any fixes for this specific issue?

Which ones did you use? Just for reference

Check output without scorer to see whether it is the scorer.

Try downsampling with ffmpeg or to get some clean samples use audacity, it exports 16 KHz WAV mono.

Mic vad streaming and vad transcriber. Found here.

Downsampling with ffmpeg did not improve results. I had 5-6 wav files, after downsampling i get equal or slightly worse transcription. Moreover, I cannot use audacity as this software will run offline on client’s machine :slight_smile:

Yep, it is the scorer. When running without scorer, I get longer transcription outputs (i.e. audio does not get skipped)

I set --pruning 0, but what did you mean by “a and b bits”? Are they the the -a and -b for build_binary, or the default_alpha and default_beta for generate_package.py? (the latter arguments are compulsory though)


I built the scorer without -a and -b for build_binary for now. It caught some more words, but still the transcription results are not that great :confused:

I am not sure building an offline application is the best usage considering the current state of deepspeech. It is still in development.

But let’s take it one by one.

You had the problem, that deepspeech showed totally wrong lengths for your input. Is this resolved by using ffmpeg or audacity?

Thinking about the bad scorer. @reuben, given he builds a custom scorer for the 0.7.1 release, what values should he use for default_alpha and default_beta. The ones from the released page as he used different ones? And could that explain strange predictions?

Yup, of course. That got resolved.

:frowning: So, what would be the best usage according to you? I read the DeepSpeech in the wild topic where many have been using DeepSpeech for - what I can tell - production environments?

Meaning running on a server where you can check the results from time to time. If you plan on doing that, fine. Then you could use ffmpeg, … to transform wavs.

As for the default_alpha/beta. If you have a good test set yourself, you could find optimal values with lm_optimizer as described in the release notes:

Subsequent to this the `lm_optimizer.py` was used with the following parameters:

* `lm_alpha_max` 5
* `lm_beta_max` 5
* `n_trials` 2400
* `test_files`LibriSpeech clean dev corpus.

to determine the optimal `lm_alpha` and `lm_beta` with respect to the LibriSpeech clean dev corpus. This resulted in:

* `lm_alpha` 0.931289039105002
* `lm_beta` 1.1834137581510284

I am not too sure whether this will increase recognition a lot, as for other hyperparameters, sometimes you have to play around a bit :slight_smile:

Unfortunately I do not have a test set :frowning: Creating test sets is too much effort as this is only a side project for me. So, I will have to manually experiment with the values, thanks though :slight_smile:

Either way you’ll need some sort of test set. Even if it’s only 100 moves. Doesn’t take more than one hour to build and you would have some idea.

Ah! Alright, I’ll create a test set and get back with results :slight_smile: