I wanted to start off by saying I have been having amazing results with DeepSpeech. I was able to get the Google Commands dataset from a 48% avg WER down to sub 3% WER and CER with fine-tuning and my own scorer.
I am also training on some air traffic data now and having rather decent results compared to what I expected. Very exciting indeed! Thank you to the team.
I just had a quick question in regards to the scorer creation (I want to make sure I am making the most out of these). I understand most of the parameters except
--top_k parameter in
"Use top_k most frequent words for the vocab.txt file. These will be used to filter the ARPA file.",
Can someone explain to me how this impacts the results of the scorer? If I have a small corpus should I just include all the words or is there some guidance to how many words I should use here. Any advice would be great!