I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened)

There are already some topics where these are discussed. The first one was asked by myself for example.

As you will see, these are bad practices: Dumping the vocabulary, auto-generate sentences by concatenating words, using a bad AI to generate them, etc.

Also, if you use some AI text generator (such as ChatGPT) you must be sure of the copyright of the output, it MUST be CC0/public domain, GPL etc is a no go.

The problem is: If you do not add quality text-corpus, the quality of the dataset will drop and there is usually no going back. Whenever some text is recorded in audio, it will stick into the dataset and it can only be removed with post-processing before training.

That makes me feel useful.

It is OK (and advisable) to write your own sentences looking at a dictionary though.

Please search for more, there are many…