Adding custom words to language model

Unless people clearly states what they need, it is also hard for us to know what is complicated and what is not.

It’s down to a text file, and a few commands to run.

Maybe you could show everyone on YouTube how simple it is to add custom words, would be good for the project and users I’m sure

Maybe I don’t have time and skill sets to make videos on youtube.

1 Like

I mean, seriously, please explain me what is complicated:

  • a text file, with sentences describing your language / vocab
  • run generate_lm.py to build it
  • run generate_scorer_package to make it suitable for deepspeech

follow the steps to repro our scorer, just concatenate your text file to the librispeech one and ensure --top_k does not get rid of it

Maybe generate_lm.py could benefit from a PR that:

  • allows merging other text file
  • do so without applying top_k filtering on those

How do you ensure top_k doesn’t remove it?

You need to remember that you already know the process, what’s possible, and what tools to use. Newcomers have no idea. I don’t know anything about “top_k”, why it would remove my additions, how to stop that. You went on to say a script needed PR for various features, but it’s not clear if they are current features, or to be added. In my case I need to add lunar crater names, from your comments it seems I must rebuild the scorer after appending my words to the existing language model, but without a clear and detailed description, I’m uncertain how this will work which makes me uncertain about proceeding, knowing how much time can be wasted on these things when the documents are limited.

Sorry, I thought this was clear that this could be useful feature we would welcome.

python generate_lm.py --help already documents it: https://github.com/mozilla/DeepSpeech/blob/master/data/lm/generate_lm.py#L144-L149

As described top_k keeps the x most frequent words. So if you add new words, they might be dropped if not frequent enough.

No, you need to read, this is explicitely what I said: we are used to all of that, and sometimes it’s hard to distinguish for us between:

  • underdocumented
  • complicated
  • unclear
  • people not finding the doc
  • etc.

yes

the clear and detailed description is already what we have in the doc

At some point, however, yes, it might be required you have to go deeper in the source in some case.

Sorry, but each and every usecase is different, and we can’t address all of them at once all by ourselves.

It is also difficult to fix problems people don’t report.

1 Like

okay but I’d say best to avoid telling newbies they may have to dig thro source code to figure things out, else you’ll lose users. Almost every serious user will need to add custom words, so it’s worth giving that some attention. I’ll now go ahead and rebuild the scorer

People levels vary. Quality of the docs and tooling improves with feedback, also.

3 Likes

I insist, please share where you have difficulties, because from here, we can’t feel any.

It’s about perception, your docs need to be written by a user, not a coder… It reads as very paraphrased, assuming users will have the same background knowledge. I’ll post how I get on

Again, feedback loop. The doc is written by us, so obviously, it has its limitations and assumptions.

:smiley: yeah, catch-22. I will feedback my experiences shortly

I need to stop my words being discarded, so can I set line 148 of generate_lm.py to false to disable top_k ? Or is there a better method?

In the current status, I don’t think so. Maybe try to generate your own data text file with more repetition of your datas to be able to keep top_k, or maybe you can also adjust the value of top_k ?

So Mozilla let go all the key programmers, leaving users to have to sort things out themselves! That’s really a very hopeful strategy, it’s more likely, IMO, that new users will walk away. Most potential users can’t afford the time to debug a speech engine.
I can’t “repeat the words” and hope they don’t get deleted, you can’t code on that basis.
Is there a funcional flowchart for these processes, showing how top_k interacts?

I was trying to help you. If you have a grudge against Mozilla, please take it to the people in charge, not those trying to help you on their free time.

This is not what we are talking about here. You are trying to build something on top of our system. I’m trying to help you, given the current state of the project, achieve that.

However:

  • I’m not in your head, so I don’t know how you think what are your expectations, what you understand / don’t understand
  • you hijacked an old thread asking for help, but mostly have been asking for people to do the work for you
  • I have been trying to help you since the beginning even though I am not supposed to work anymore on this project
  • so far, you just complain that “you cannot repeat the word”: please tell me what is difficult in generating a text file with copy/pasting
  • yes, you need to dig into the codebase to understand some things

That’s unfortunate, but:

  • this is a a project exposing a library and tooling around,
  • we try to make it as simple as possible,
  • we can’t cover all and every usecase ex-nihilo
  • feedback loop is a thing
  • we are not perfect, sorry

No, read the code.

2 Likes

Not sure what you are referring to here.