I think its time to talk about AI generated sentences again

stergro · March 28, 2023, 3:54pm

A few years ago the Common Voice team decided against AI generated sentences in the corpus, and by that time I completely agreed with them. As we all propably know, things have changed a lot in the last months when it comes to AI and maybe it is time to reevaluate the situation.

Why should we do this?

I assume we still l want to avoid duplicates and have the goal that every sentence is only recorded once. This means for a good dataset, we need millions of public domain sentences, and they should be close to natural spoken language.

Right now I see a few issues with the sentence corpus:

The Wiki-Export gives us a lot of unique sentences, but many contain hard to pronounce words and technical language. It’s good to have this in the mix, but it shouldn’t be the mayority of the sentences.
The sentence collector is a great tool, but it takes very long to collect and verify sentences. Unless you have lot of motivated people, this is too slow.
Other ideas like the idea that people donate their chats or their content have worked in the past, but they still require a lot of manual labor to filter the sentences, bring them into the right format and verify the error rate.

Could AI be an option?

With GPT-4 and a prompt like “Create 100 natural spoken language sentences on a wide range of different topics, each under 10 words! The sentences should not contain any abbreviations.” You can easily create thousands of sentences like these:

Dogs are truly a person’s best friend.
I love the smell of fresh rain.
Reading books can transport you anywhere.
The sunset painted the sky orange.
Exercise helps maintain a healthy lifestyle.
Cooking at home can be fun.
Traveling expands your cultural horizons.
I enjoy painting landscapes on weekends.
The ocean waves calmed my soul.
The garden was in full bloom.

Music has the power to heal.
The stars twinkled in the night.
Laughter is the best medicine.
I have a passion for photography.
A warm cup of tea soothes.
The forest was alive with birdsong.
Volunteering brings joy and fulfillment.
She wore a beautiful red dress.
The cake was simply delicious.
The movie kept us on edge.

I played around with the prompt. It creates a wide variety of sentences, even when called very often and you can also ask it to create the 100 sentences about a certain topic. You could loop through a list of many topics and automate everything using the API of OpenAI. From a technical point of view I don’t see any problems.

But should we do this?

On the pro side:

It could give us huge numbers of natural sentences for many languages
The sentence structure, the topics and other factors are easily controllable
It is easy to automate
it is relatively cheap
<10 word sentences in an unsorted list are very unlikely to cause any copyright claims

On the con side:

We could copy the bias of the GPT training data
Maybe the sentences are less diverse than they appear right now, especially when created in huge numbers
We don’t know the error rate of the sentences yet, especially for other languages than English

Given that most sentence-collections are already biased because they only have a few big sources, adding AI generated sentences looks like an improvement to me. At least it would add some easy to read sentences in relevant numbers.

We could start with a small proof of concept. For example adding 50 000 generated sentences to the 1.6 million English sentences won’t cause much damage, but could give us some valuable insights. After that we could expand the experiment to bigger numbers or more languages.

What do you think?

Francis_Tyers · March 28, 2023, 5:21pm

I don’t think it is a good idea for the following reasons:

For languages where there is enough training data (for LLMs):
- there is enough text to select real sentences
- a better approach to expanding the corpus is to work with specific domains
- the sentences will need to be checked anyway
For languages where there isn’t enough training data (for LLMs):
- the sentences have to be checked anyway, and mostly they are going to be rubbish (we tried recently with varieties of Nahuatl, it was a disaster), it will cause headaches for reviewers

If you wanted to use GPT to generate sentences and then run them through the normal review process, I don’t see a problem, but it seems to me that working on a specific task/application for larger languages (English, German, Esperanto) seems like it would be more productive.

bozden · March 28, 2023, 6:23pm

I concur with this. IMHO all points mentioned in the above two posts are valid and should be evaluated case by case, without any hard rule.

When I put my hands on GPT 3.5, I experimented with this idea and presented that in matrix channel. My very first concern at that time was if there are any license issues: “Who is the license owner of the AI-generated sentences? OpenAI? Me, letting it generate them? Or the whole of humanity who created the content on the Internet (which is the training based on), so public domain?”.

After reading some legal debates about this, I’m inclined to the last option, it is public domain. But, to be sure, the generated sentences must be checked by a human to prevent licensed material (like Starship Enterprise) or cultural biases, like @Francis_Tyers warns us about.

Thinking of my language (Turkish), it is spoken by 1% of the total world population, but the content language of Turkish in the top 10 million websites is 2.4% Turkish (March 2023 values). So, IMO we are good to go.

One of the problems we are facing with current sentence generation techniques is the Domain Specific corpora. Every day volunteers can generate everyday sentences, but for sentences from more diverse areas (medical, history, arts, law, technology, or more fine-detailed compartmentalization) we need groups of volunteers from those areas. On the other hand, validation of these also requires expert volunteers in those areas. This is why I proposed domain-based corpora creation last year, so that medical persons can add / validate / record / validate medical sentences/recordings.

I hope this will be added as a next step this year, with the SC being incorporated into CV, we even have a field for it in the database, we need just to decide on what domains are.

But, until then, GPT is a good resource. I can easily make it generate sentences in those domains, in layman’s terms, without using Latin counterparts, not like the ones we get from Wikipedia for example.

I’m all for it…

cjbaker · March 28, 2023, 7:05pm

I think the idea has some potential. I’ve often noticed that humans making up “random” sentence prompts tend towards a very particular style and form, and I’m not surprised that your GPT-4 example was able to capture it. Humans have similar difficulties when they try to make up random numbers. A few such qualities that I notice in your example sentences are that they’re all declarative (no questions), have simple syntax (no relative clauses etc.), and are semantically predictable. Unfortunately, this sort of bias can influence the training of models such as those used in ASR, as many potential applications will not match the training domain.

You might have better luck just having a language model write the continuation of an existing text, rather than use one of these question-answering front ends. Maybe you could still enforce “no abbreviations” etc. by showing abbreviations written out in the existing text.

A problem I’ve seen before in the CommonVoice prompts is hundreds of sentences which seem to be generated from templates, for example in zh-CN the 1000+ sentences with “X为Y属下的一个种” “X is a species of the genus Y” and many others (I suspect these come from Wikipedia stubs), or French which has thousands of street addresses with numbers. I keep meaning to try measuring the perplexity of some CommonVoice prompts using a language model, I’ll see if I can get this going. This would enable us to measure how predictable a given prompt is, given the whole set of prompts for that language.

Regarding low-resource languages, we would probably need to train or finetune our own models for this purpose, rather than attempting to convince an existing model using prompts to write output in a particular language.

bozden · March 28, 2023, 7:40pm

I noticed that in my initial trials and asked GPT to generate conversations between a customer and a server in several scenarios. These came out with many questions and answers and they were conversational, as preferred by CV.

I also asked it to produce sentences with min. 5 words (max was 14, the default), as we need to eliminate already existing sentences. We added many such shorter sentences in the past. Longer sentences are needed in our corpora…

bozden · March 28, 2023, 7:42pm

But I suspect these could come from someone trying to add domain-specific corpora for his/her use case…

stergro · March 28, 2023, 9:22pm

I think this could be a good first step and would already give us more options. @jesslynnrose Do you think it would be an issue if we wrote an AI model in the source field in the sentence collector?

kathyreid · March 28, 2023, 11:07pm

I strongly agree with @Francis_Tyers on this question.

As I’ve been analysing bias in voice datasets, one of the recurring themes is that there is a lot of bias around Named Entities - people, places and products. They occur infrequently in the text prompts and in many cases they are not represented in language models.

Having sentences around specific domains is one way to combat this bias - because Named Entities will be specific to categories like Medicine, Science, Politics, Sport, Quick Service Restaurants and so on.

LLMs like GPT-whatever are trained on public web data - such as Wikipedia, and other public websites - companies like OpenAI are cagey about exactly what data their LLMs are trained on. However, this reproduces the bias of the public web - and means that less-frequently-occurring Named Entities are going to be poorly represented in LLM data.

A case in point:

I live in the city of Geelong, on the unceded land of the Indigenous Waddawurrung people of the Kulin Nation. The name Geelong comes from the Waddawurrung word ‘Djilang’, meaning cliff.

The Named Entities Geelong, Waddawurrung, Kulin Nation and Djilang are rarely, if ever, correctly recognised by STT systems. They don’t occur frequently - if at all - in training data. However, Named Entities like New York or Bawston, sorry Boston, do. So they are better recognised.

We need to be collecting the data that isn’t out there. The people, the products, the places - that are important to the language communities we serve.

kathyreid · March 28, 2023, 11:26pm

There’s also another major point which I think we’re glossing (pun intended) over.

The data upon which LLMs are trained may be openly licensed, but we don’t know. There are major legal debates at the moment around whether simply hoovering up all the public web data and training on it is permissible.

If Common Voice starts to use LLM-generated prompts, the provenance of those prompts becomes questionable. At the moment, all CV prompts are CC-0, and there is at least some level of assurance over this. This means that the speech data, which is CC-0 licensed, has a “chain of provenance”.

If CV uses LLM-generated prompts, upon which speech data is gathered, and the licensing of LLM-generated text is called into question, this brings into question the licensing of the speech data which is elicited from a sentence.

From a licensing perspective, at least for OpenAI, you own the output that is generated from an LLM (other providers may vary).

(a) Your Content. You may provide input to the Services (“Input”), 
and receive output generated and returned by the Services 
based on the Input (“Output”). 

Input and Output are collectively “Content.” 
As between the parties and to the extent permitted by applicable law, 
you own all Input. Subject to your compliance with these Terms, 
OpenAI hereby assigns to you all its right, title and interest in and to Output. 
This means you can use Content for any purpose,
 including commercial purposes such as sale or publication, 
if you comply with these Terms.

 OpenAI may use Content to provide and maintain the Services,
 comply with applicable law, and enforce our policies.
 You are responsible for Content, 
including for ensuring that it does not violate 
any applicable law or these Terms.

I want to complicate this further.

The funding situation for LLM companies is complex. OpenAI is 49% owned by Microsoft. Google owns Bard. Google has a $USD 300 million stake in Anthropic, makers of Claude. NVIDIA owns Megatron, and for openness and transparency, we should state that NVIDIA has made a $USD 1.5 million contribution to Common Voice.

Are there commercial incentives for the text that is generated by LLMs? For example, let’s say that Company X has an LLM model. Will Company X try to drop Named Entities associated with Company X more frequently into text produced by that LLM?

Let’s say that Company X makes “Pigeon Widgets”, a type of toy.

Would stories or narratives generated by Company X’s LLM feature Pigeon Widgets more frequently? And as a consequence, would the tokens Pigeon Widgets appear more frequently in prompt text? And then in speech data? And then in language models? Such that the utterance Pigeon Widgets is very accurately predicted in speech?

It sounds far-fetched, but it’s a consequence of how scales of bias interlock and combine in the data and machine learning lifecycle.

irvin · March 29, 2023, 12:35pm

i’m also considering the approach to generate sentences with gpt-4.

I kept a list of “missing chars” in zh-tw common voice corpus (about 3011 Chinese chars, 33% of pronunciations are missing currently), and now i can easily close the gap by ask ai to “generate sentences that contains following words”. it’s definitely an improvement of diversity if we evaluate with such metrics.

The generated sentences considering non-copyrighted in local legal framework as well, and we will still do manual review like how we did for other source. I don’t think that could cause any problems atm.

bozden · March 29, 2023, 7:23pm

Very good points @kathyreid. And thank you for sharing GPT output’s legal status.

Just for the sake of discussion:

As someone against big-tech owning closed-sourced AI’s, I read a lot about it in the passing months. I’m with you about the general risks and I’m all against an unsupervised AGI.
On the other hand, the bias on CV datasets is much higher than anything which might come from GPT. With the current limitations on sentence collection, most languages cannot go further in this.
I already used most of the public domain material I have and the next writer which will become public domain is in 2024. My attempts to include other sources like national assembly proceedings bounced back from Mozilla, I even got a legal letter from the parliament for it.
According to statistics, I could only cover a small percentage of the whole vocabulary. And it is not possible to generate the remaining by hand. Even if they are generated, they will be pruned by LM’s (e.g. KenLM) as the sources will not include them and the performance will be nearly the same - unless you fine-tune for that domain.
I think the usage of GPT outputs as it is not a good solution. GPT can provide some input but you have to work on that (didn’t pay for v4 yet). I see them as assistants, not solution providers. For example, I asked GTP 3.5 to give me the least used words in Turkish, and it gave a half-correct answer. I can use this output to generate sentences for example.

One other way of thinking is as follows (ref. the discussions on giving rights to sentient AIs): We are all biological machines. I learn from public sources and generate some sentences. GPT does the same - one might see it as a child, who might say some gibberish (corrected by adults)…
I think the lesser-used words in languages are mostly domain-specific, like the location names you mention. For these, I think the best way is to add domain-specific corpora to CV.

To summarize: Using GTP will be the solution for most of us. You spoke of high resourced languages and low resourced ones, but not the ones in-between, which are the majority here.

jesslynnrose · March 30, 2023, 11:20am

I’m really hesitant around AI gen sentences for many of the reasons that @Francis_Tyers and @kathyreid have already covered.

I’m also additionally concerned about the ethics of whole-web data scraping. Not knowing the provenance of the data collected to build and refine these models means that we could be drawing on (and reassembling) not just work with yet to emerge legal issues like copyright, but also drawing on works that the original writers never intended to have reused in this way.

I think it’s definitely something to keep an eye on, but I would love to stick with text data that has a clear provenance while the space continues to emerge.