Confirming data used for training 0.5.1 LM

nmstoker · July 12, 2019, 12:09pm

Hopefully this is an easy / quick to answer question

Regarding the language model included with the 0.5.1 release could someone from the team confirm that it was trained with the data / process here: DeepSpeech/data/lm at master · mozilla/DeepSpeech · GitHub ?

I just wanted to be sure as I’m looking at extending the LM with some particular text for my application (eg names not present in LibriSpeech) and wanted to know I was starting from the correct base.

@kdavis There was also talk of using only the top 10k - 50k words here - has that been implemented yet or is it still a work in progress? Seemed like it had potential.

kdavis · July 12, 2019, 3:18pm

The LM was indeed trained as in described here https://github.com/mozilla/DeepSpeech/tree/master/data/lm

nmstoker · July 12, 2019, 6:01pm

Great. Thanks for confirming.

Topic		Replies	Views
What is the 0.6.0 pretrained model trained on? DeepSpeech	1	298	January 19, 2020
Language Model during training effect DeepSpeech	6	1332	August 15, 2019
Building LM, noticed vocab.txt and librispeech-lm-norm.txt have a lot of low quality words DeepSpeech	3	1410	December 7, 2018
DeepSpeech with Common Voice Training Data DeepSpeech	7	2519	December 2, 2019
DeepSpeech Latest Results with English DeepSpeech	10	1295	July 14, 2019

Confirming data used for training 0.5.1 LM

Related topics