Inquiry on Scorer Creation --top_k parameter

Hello.

I wanted to start off by saying I have been having amazing results with DeepSpeech. I was able to get the Google Commands dataset from a 48% avg WER down to sub 3% WER and CER with fine-tuning and my own scorer.

I am also training on some air traffic data now and having rather decent results compared to what I expected. Very exciting indeed! Thank you to the team.

I just had a quick question in regards to the scorer creation (I want to make sure I am making the most out of these). I understand most of the parameters except --top_k

The --top_k parameter in generate_lm.py says: "Use top_k most frequent words for the vocab.txt file. These will be used to filter the ARPA file.",

Can someone explain to me how this impacts the results of the scorer? If I have a small corpus should I just include all the words or is there some guidance to how many words I should use here. Any advice would be great!

Thank you.

It’s here to help remove noise from data and try to keep same level of accuracy while having a smaller scorer

1 Like