A few years ago the Common Voice team decided against AI generated sentences in the corpus, and by that time I completely agreed with them. As we all propably know, things have changed a lot in the last months when it comes to AI and maybe it is time to reevaluate the situation.
Why should we do this?
I assume we still l want to avoid duplicates and have the goal that every sentence is only recorded once. This means for a good dataset, we need millions of public domain sentences, and they should be close to natural spoken language.
Right now I see a few issues with the sentence corpus:
- The Wiki-Export gives us a lot of unique sentences, but many contain hard to pronounce words and technical language. Itās good to have this in the mix, but it shouldnāt be the mayority of the sentences.
- The sentence collector is a great tool, but it takes very long to collect and verify sentences. Unless you have lot of motivated people, this is too slow.
- Other ideas like the idea that people donate their chats or their content have worked in the past, but they still require a lot of manual labor to filter the sentences, bring them into the right format and verify the error rate.
Could AI be an option?
With GPT-4 and a prompt like āCreate 100 natural spoken language sentences on a wide range of different topics, each under 10 words! The sentences should not contain any abbreviations.ā You can easily create thousands of sentences like these:
Dogs are truly a personās best friend.
I love the smell of fresh rain.
Reading books can transport you anywhere.
The sunset painted the sky orange.
Exercise helps maintain a healthy lifestyle.
Cooking at home can be fun.
Traveling expands your cultural horizons.
I enjoy painting landscapes on weekends.
The ocean waves calmed my soul.
The garden was in full bloom.Music has the power to heal.
The stars twinkled in the night.
Laughter is the best medicine.
I have a passion for photography.
A warm cup of tea soothes.
The forest was alive with birdsong.
Volunteering brings joy and fulfillment.
She wore a beautiful red dress.
The cake was simply delicious.
The movie kept us on edge.
I played around with the prompt. It creates a wide variety of sentences, even when called very often and you can also ask it to create the 100 sentences about a certain topic. You could loop through a list of many topics and automate everything using the API of OpenAI. From a technical point of view I donāt see any problems.
But should we do this?
On the pro side:
- It could give us huge numbers of natural sentences for many languages
- The sentence structure, the topics and other factors are easily controllable
- It is easy to automate
- it is relatively cheap
- <10 word sentences in an unsorted list are very unlikely to cause any copyright claims
On the con side:
- We could copy the bias of the GPT training data
- Maybe the sentences are less diverse than they appear right now, especially when created in huge numbers
- We donāt know the error rate of the sentences yet, especially for other languages than English
Given that most sentence-collections are already biased because they only have a few big sources, adding AI generated sentences looks like an improvement to me. At least it would add some easy to read sentences in relevant numbers.
We could start with a small proof of concept. For example adding 50 000 generated sentences to the 1.6 million English sentences wonāt cause much damage, but could give us some valuable insights. After that we could expand the experiment to bigger numbers or more languages.
What do you think?