I think its time to talk about AI generated sentences again

Francis_Tyers · March 28, 2023, 5:21pm

I don’t think it is a good idea for the following reasons:

For languages where there is enough training data (for LLMs):
- there is enough text to select real sentences
- a better approach to expanding the corpus is to work with specific domains
- the sentences will need to be checked anyway
For languages where there isn’t enough training data (for LLMs):
- the sentences have to be checked anyway, and mostly they are going to be rubbish (we tried recently with varieties of Nahuatl, it was a disaster), it will cause headaches for reviewers

If you wanted to use GPT to generate sentences and then run them through the normal review process, I don’t see a problem, but it seems to me that working on a specific task/application for larger languages (English, German, Esperanto) seems like it would be more productive.

Topic		Replies	Views
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3710	September 11, 2019
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2266	March 15, 2023
Generate sentences using ML Common Voice sentence-collection	7	1187	December 20, 2019
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8907	January 9, 2019
Problems finding public domain sentences Common Voice sentence-collection	26	2994	June 10, 2019

I think its time to talk about AI generated sentences again

Related topics