Inquiry on Scorer Creation --top_k parameter

Epoetin · August 6, 2020, 7:12pm

Hello.

I wanted to start off by saying I have been having amazing results with DeepSpeech. I was able to get the Google Commands dataset from a 48% avg WER down to sub 3% WER and CER with fine-tuning and my own scorer.

I am also training on some air traffic data now and having rather decent results compared to what I expected. Very exciting indeed! Thank you to the team.

I just had a quick question in regards to the scorer creation (I want to make sure I am making the most out of these). I understand most of the parameters except --top_k

The --top_k parameter in generate_lm.py says: "Use top_k most frequent words for the vocab.txt file. These will be used to filter the ARPA file.",

Can someone explain to me how this impacts the results of the scorer? If I have a small corpus should I just include all the words or is there some guidance to how many words I should use here. Any advice would be great!

Thank you.

lissyx · August 6, 2020, 7:24pm

It’s here to help remove noise from data and try to keep same level of accuracy while having a smaller scorer

Topic		Replies	Views
DeepSpeech Language Model parameters DeepSpeech	5	1587	September 13, 2020
Question regarding the new scorer function instead of LM+trie DeepSpeech	8	826	May 20, 2020
Learning new words for STT DeepSpeech	8	563	November 10, 2020
Building my own scorer for Deepspeech DeepSpeech	0	368	November 10, 2021
Training own scorer for Deepspeech 0.7.4 DeepSpeech	4	381	December 8, 2020

Inquiry on Scorer Creation --top_k parameter

Related topics