I have a very specific use-case vocabulary with only 73 distinct English words. I generated a text file containing all possible legal combinations of those words, it had around 2*10^5 lines, and has size 4.4MB. I generated the scorer package using these files. (Using instructions here)
I thought this would be enough, since the acoustic model remains the same (English). I used this scorer combined with the pre-trained v0.7.0 model.pbmm file to run the vad_transcriber example.
However, the results were not good! For example, I said “queen a takes b four” but the output was “horse b to”. I changed value of --aggressive
from 0 to 3 without success. When recorded without background noise (ceiling fan), it generated “rex b four”.
I am recording on a 22kHz headset microphone and downsampling to 16kHz using sox
. I say one word per second, and the words are clear to me when I hear the downsampled wav file myself.
I have also tried the mic vad streaming example, and it does not produce good transcription either.
Is there anything else that is needed to be done?
PS: transcription is worse when using the pretrained v0.7.0 scorer (it generates some non-chess gibberish, which is kinda expected since it is a general english language scorer ).