Discuss potential PR related to generate_lm.py

nmstoker · February 22, 2021, 12:20am

Further to the idea floated here: Adding custom words to language model
I was wondering if I could submit a PR for this? I wanted to sound out the team to avoid wasting time (eg if my proposed approach wasn’t optimal) or you have anything particular I should try to do with it.

There are a few ways this could be handled but I though it might be minimally disruptive if I simply let the user specify multiple input texts, so it would cycle over all of them.

I’ve got a basic version hacked together. I keep the CLI parameters the same but on the end I’ve added one to let you specify a delimiter (which defaults to a comma).

Then all it does is it splits the --input_text up based on the delimiter and creates the vocab and lower.txt files for each input text.

Right now I’m using the same top K for all, but if this looks worth pursuing I’d make it so the top K was split similarly to the input text parameter.

With the example I’m testing with it doesn’t make a big difference not having the top K be input specific, as the second input text is massively shorter than the 500k (so for it all words get included) but I’m thinking there could be scenarios where it’s good to be able to set this per input text.

The other minor thing I was going to do was have it check for the KenLM path up front (as it’s annoying to find you’ve messed up the path only after it has processed the vocab, which takes a bit of time even on a fast PC, only to crash if there’s a problem )

Anything else I should consider?

Any objections if I switch the .format strings to f-strings? (or is that best done separately/not at all?)

othiele · February 22, 2021, 12:33pm

Good idea, there are a lot of people that have just a small amount of data and they are confused. As the script is currently meant to process all of Wikipedia it would be great if it could handle smaller amounts.

nmstoker · February 25, 2021, 2:46pm

I’ve submitted a PR for this here: https://github.com/mozilla/DeepSpeech/pull/3542

Topic		Replies	Views
DeepSpeech Language Model parameters DeepSpeech	5	1587	September 13, 2020
Inquiry on Scorer Creation --top_k parameter DeepSpeech	1	337	August 6, 2020
Generate LM DeepSpeech	3	562	November 12, 2019
Including punctuation (and capitalization?) in the training text / language model? DeepSpeech	2	784	July 6, 2020
Customizing language model DeepSpeech	13	8586	February 27, 2018

Discuss potential PR related to generate_lm.py

Related topics