Discuss potential PR related to generate_lm.py

Further to the idea floated here: Adding custom words to language model
I was wondering if I could submit a PR for this? I wanted to sound out the team to avoid wasting time (eg if my proposed approach wasn’t optimal) or you have anything particular I should try to do with it.

There are a few ways this could be handled but I though it might be minimally disruptive if I simply let the user specify multiple input texts, so it would cycle over all of them.

I’ve got a basic version hacked together. I keep the CLI parameters the same but on the end I’ve added one to let you specify a delimiter (which defaults to a comma).

Then all it does is it splits the --input_text up based on the delimiter and creates the vocab and lower.txt files for each input text.

Right now I’m using the same top K for all, but if this looks worth pursuing I’d make it so the top K was split similarly to the input text parameter.

With the example I’m testing with it doesn’t make a big difference not having the top K be input specific, as the second input text is massively shorter than the 500k (so for it all words get included) but I’m thinking there could be scenarios where it’s good to be able to set this per input text.

The other minor thing I was going to do was have it check for the KenLM path up front (as it’s annoying to find you’ve messed up the path only after it has processed the vocab, which takes a bit of time even on a fast PC, only to crash if there’s a problem :slightly_frowning_face: )

Anything else I should consider?

Any objections if I switch the .format strings to f-strings? (or is that best done separately/not at all?)

1 Like

Good idea, there are a lot of people that have just a small amount of data and they are confused. As the script is currently meant to process all of Wikipedia it would be great if it could handle smaller amounts.

I’ve submitted a PR for this here: https://github.com/mozilla/DeepSpeech/pull/3542