I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened)

bozden (Bülent Özden) January 18, 2023, 12:24pm 20

There are already some topics where these are discussed. The first one was asked by myself for example.

As you will see, these are bad practices: Dumping the vocabulary, auto-generate sentences by concatenating words, using a bad AI to generate them, etc.

Also, if you use some AI text generator (such as ChatGPT) you must be sure of the copyright of the output, it MUST be CC0/public domain, GPL etc is a no go.

The problem is: If you do not add quality text-corpus, the quality of the dataset will drop and there is usually no going back. Whenever some text is recorded in audio, it will stick into the dataset and it can only be removed with post-processing before training.

That makes me feel useful.

It is OK (and advisable) to write your own sentences looking at a dictionary though.

Please search for more, there are many…

Topic		Replies	Views
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8873	January 9, 2019
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3687	September 11, 2019
About the new English Sentences Common Voice feedback , issue	37	3322	May 31, 2019
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	4028	January 26, 2019
[Help Wanted] Write some nice, short sentences for people to read Common Voice	35	7289	July 23, 2018

I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened)

Related topics