Adding custom words to language model

Rob_Christopher · February 10, 2021, 1:42pm

Thanks for the heads up! That also needs to be mentioned in the docs!

Rob_Christopher · February 10, 2021, 1:48pm

I’m going to give up at this point, I don’t have time to setup a Linux box or use an online one, I need an OS independent way to add custom words, I just don’t have the time at the moment.

Rob_Christopher · February 10, 2021, 1:55pm

If anyone knows the value ranges to be used with addHotWord, please post! Thxs

lissyx · February 10, 2021, 2:04pm

Too bad you did not follow the guidelines we have documented on Discourse at the very begining to request any help, knowing you are using Windows would have been helping.

Yes, as @othiele said, training on Windows is not supported, and we are welcoming PRs for that.

That being said, wget exists for PowerShell, just use powershell and you get it.

gunzip to get raw text, cat >> to append. Yes, those are linux commands, I don’t use Windows, sorry.

lissyx · February 10, 2021, 2:06pm

I never said I don’t know how to control the word filtering, I said we have no code in place in generate_lm.py to do it.

Some of the people who worked on it are still there, but they are not working anymore on DeepSpeech. Myself and @reuben are doing that on our spare time, now.

In the past, I could have taken the time to add your feature quite quickly. Unfortunately, I don’t have the time now.

lissyx · February 10, 2021, 2:07pm

We don’t, it’s application-specific, so you need to experiment on your own usages.

Rob_Christopher · February 10, 2021, 2:12pm

Why use powershell when it can just be downloaded in a browser?

I found a post by JRMeyer about hot-word boosting values from last August that should go in your doc file!

As things are, it would take little effort to improve the user experience with DeepSpeech, but I don’t see that happening with the current politics.

lissyx · February 10, 2021, 2:14pm

You asked how to use wget, I’m helping you.

Thanks for your help, I’m glad I have taken time to try and get you something that works and be rewarded like that.

Rob_Christopher · February 10, 2021, 2:20pm

Sorry but this is just poor management, as over 5 months ago you were posting with JRMeyer and reuben about addHotWords, you posting that you weren’t sure it “even works”, reuben posting he was unsure the coefficient could be meaningfully applied, and JRMeyer confidently posting value ranges, and here we are and nothing has been done to resolve this issue at all. How is that good or useful?

lissyx · February 10, 2021, 2:23pm

I’m going to save the bytes on my LTE backup to do actual work, until my FTTH connection hopefully comes back.

lissyx · February 10, 2021, 2:28pm

Actually, FTTH is back.

So two months after layoff and while we were unable to know if Mozilla would continue DeepSpeech

Link please? Josh worked on this feature. I’m not sure what discussion you are referring to.

What issue ?

JGKK · February 10, 2021, 2:37pm

For smaller domain specific language models i found that boost values in the range of 1-20 gave me sufficient results to improve the recognition of the wanted words greatly. Mostly I stay in the 1-10 range actually.
Anything higher than that gave me worse results.
Keep in mind that a boost word doesn’t work with a space in it.
But thats just my personal experience thats limited to smallish language models.
Do you really need the general language model though? It really sounds to me like you have a very specific use case?
Wouldn’t it be easier to create a scorer from just the sentences that you actually expect?

lissyx · February 10, 2021, 2:32pm

If you are looking for off-the-shelf plug-and-play 24-7 support, sorry, but this is not what we can yet provide.

Rob_Christopher · February 10, 2021, 2:38pm

We have a general purpose 3D scene application, and we use standard language to control it, but then certain content also has custom words, crater names, biological words, etc

Rob_Christopher · February 10, 2021, 2:43pm

Come on, you don’t even know the parameter usage for your own functions! This is just poor management, and why rockets and planes blow up and crash, and maybe why Mozilla bosses gave up, I wonder. Let’s not make excuses, you need to sort this critical function out.

Rob_Christopher · February 10, 2021, 2:45pm

Your posts about the boost value for addHotWords is titled “enable hot-word boosting #3297”. It’s on your github.

lissyx · February 10, 2021, 2:45pm

github.com

mozilla/DeepSpeech/blob/master/data/lm/generate_lm.py#L38-L45


# Save top-k words
print("\nSaving top {} words ...".format(args.top_k))
top_counter = counter.most_common(args.top_k)
vocab_str = "\n".join(word for word, count in top_counter)
vocab_path = "vocab-{}.txt".format(args.top_k)
vocab_path = os.path.join(args.output_dir, vocab_path)
with open(vocab_path, "w+") as file:
    file.write(vocab_str)

lissyx · February 10, 2021, 2:48pm

This is a long PR, I’m not sure I remember all of it.

Rudeness is not going to get you any help, you know. Can we please focus on what is actionable?

Rob_Christopher · February 10, 2021, 2:50pm

Try testing that with words like schrödinger (or schroedinger), schumacher and tycho

lissyx · February 10, 2021, 2:51pm

It does looks like you still don’t understand: we are not working anymore on this as part of our job.

I currently don’t have much spare time to allocate to work on this kind of feature, sorry. I burnt myself over the 2020 year, especially preparing the 1.0 release, that layoff wrecked, so now I can’t push too much and I physically need resting.