How do I add single word for my language?

I beg to differ.

I think the main reason people tend to include 5-10 word sentences in the past was that the Language Model in Deepspeech (and in Coqui) is using 5-gram models.

As I mentioned above, many utterances in our everyday conversations include less than 5 words, which mostly include single words. For example, if you are commanding a machine, if you are asking a specific thing and getting answers…

There is nothing wrong with single words. Yes, you won’t dump the vocabulary, but anything conversational will be OK IMHO.

  • Coffee?
  • Yes!

The paradigm has been shifting to edge computing where the Acoustic Models get more importance and simple/short sentences are dominant.

We also recently removed single words from Cantonese as these were super boring.

Nope, I Google translated them at that time, they were non-conversational, few were single words, except a couple of them could be OK. Also, they were Mandarin…

The OP is asking about a single word. Boring would not be a problem. If you examine them, many language datasets include single-word sentences.

What I would advise thou:

If you are adding many short sentences, you need to counter-balance them with longer ones: 1000 short ones + 1000 longer sentences from a book. Mix them randomly before posting.

This is what I’ve been doing. I analyze every book/subset for this purpose. You can see my analyses for every resource I added here (in Turkish, but I point you to the tables):

1 Like