Explicitly forbidding/limiting TTS usage?

I just posted the following issue:

I think it is very important…

1 Like

This is a complex issue, and I want to use this post to pull out some of the threads of the issue. In summary, there have been some shifts in the broader environment which change the affordances of the Mozilla Common Voice dataset itself. That is, the data is the same, but how it operates in the world - what can be done with it - has changed. The change in affordances is what is making the community think differently about CV’s licensing arrangements.

Here are some of the changes in affordance:

Supply side changes

Reduction and change in data requirements as TTS zero-shot learning algorithms advance

Traditionally, the “best” TTS voices required a large amount of specially-sourced and recorded audio data; clean, from the same speaker, with strong phonetic coverage, strong n-gram coverage (that is, phonemes that occur in sequence), and a range of styles, preferably labelled. A good example of TTS datasets is Keith Ito’s LJ Speech, or the LibriTTS datasets - both derived from the LibriSpeech project, which itself is derived from LibriVox - which was never intended either an STT or TTS dataset.

With advances in zero-shot learning algorithms, such as VALL-E from Microsoft, which was released in January, the requirements to train TTS models have changed; less data is needed. Instead of training a TTS from a single speaker, as is the approach for most modern TTS algorithms, such as Tacotron or FlowTron, VALL-E first uses a generative approach, using the LibriLite dataset (60k hours) - which is derived from LibriVox as well. It then uses a transfer learning approach, from what I can tell, on unseen speakers, to “train” a model for their specific voice.

That is, we no longer need tens or hundreds of hours of a single speaker’s speech; we only need a few seconds and transfer learning does the rest (because attention is all you need!).

It’s easier to train models because we have platforms like Hugging Face and Colab, and maturity in libraries like PyTorch and Fast AI

Another factor that has changed the affordances of the Common Voice dataset is the wide availability of machine learning platforms such as HF and Colab, as well as maturity in libraries like PyTorch and Fast AI. You no longer need to have a PhD in machine learning to be able to train a generative model. Sure, you might need to pay a bit for GPU compute time, but the capabilities and barrier to entry to training generative speech models has reduced.

Moreover, we have better documentation and resources for how to use these platforms and libraries, which further reduces the barrier to entry.

Wide availability of TTS algorithms

Many TTS algorithms are widely available on code-sharing platforms, such as GitHub - Tacotron, FlowTron, CoquiTTS, ViTS, (but not VALL-E, whose GitHub is empty). This further reduces the barrier to entry to creating generative speech models.

Demand side changes

Industry demand

The above points cover the supply side changes. There are also demand side changes.

We are seeing more demand from industry to provide synthetic voices - for the metaverse, for avatar projects, for speech-enabled devices like car consoles, appliances such as Home Assistant, and for voice-overs for video platforms like Tik Tok and Reels. We want more voices.

Royalties

Industry also wants to use the voice royalty-free; without paying a fee each time the voice is used. If you use a cloud service, like Amazon Polly, then you pay for each API call. Coqui TTS charges you based on how minutes of voice you synthesize. A royalty-free voice, where you don’t have to pay for the data that is used to create the voice, is very attractive.

So, what can we do about it?

This situation is further complicated by the existing licensing of Common Voice - CC0 or public domain. This allows the dataset to be re-hosted, irrespective of whether Mozilla wants this - such as has been done by Hugging Face. HF does enforce some restrictions - such as having to provide your email address - but in reality there’s no easy way for them to enforce the restriction of “you must refrain from individually identifying people in the dataset” - they’re not going to want to play this enforcement role - and neither is Mozilla Foundation.

The public domain licensing of Common Voice data means that it is difficult to restrict how that data is used; such as explicitly forbidding TTS usage.

So, if we want to restrict how Common Voice data is used, we need to have more restrictions around it.

In my view, CV needs to be relicensed as a data trust.

A data trust

A data trust provides stewardship of data, so that the data is used only in accordance with the wishes of the people who have contributed to the dataset. This would be cumbersome to do given the tens of thousands of people who have contributed to Common Voice, but a data trust could be used with a permission model - such as “I allow my data to be used for STT” or “I allow my data to be used for TTS”.

Then the data consumer - the party using the data - is required to comply with the terms of the data trust, and this is enforced (somehow).

This is a more intensive approach, and requires a different engineering approach to the platform, but as external changes influence the affordances of the data itself, it’s the only way I see to protecting the rights of data contributors to Common Voice.

So, that’s a very long post with a lot of thoughts, but the key message is this: the external environment has changed, and it’s enabled people to do different things with the Common Voice dataset; things it wasn’t intended to be used for. If we don’t want people to use the data in unintended ways, then we need to think differently about how the CV dataset is made available longer-term.

2 Likes

Thank you @kathyreid, really! Great write-up!

I only have a simple & naïve reasoning.

When I started with Common Voice and called people to participate, it was safe wrt TTS, because it needed studio environments and hours of data and other qualities that you mentioned. CV emerged with Deepspeech and it was intended for STT.

Now, even my family members and friends (mostly women) have multiple hours of recordings that became usable for TTS, to be easily embedded in “nasty appliances”.

Until now, I’ve been very supportive of the CC0 licensing, as our voices will only become small bumps in a model’s parameters.

Also, when I was asked about biometric data security, I’ve been showing the hash ids and telling people that they cannot be identified if they won’t put their real names and show that on the hall-of-fame tables. These are not valid anymore. While we are talking about the ethics of gender categorization models, our “democratized voices” can be used for other purposes.

In pace with these advances, related country legislations also tighten and will continue to do so. We can only expect Mozilla does the same.

One proposed solution was removal of client_id field, but that will prevent all researchers to analyze, re-split etc the data, e.g. for bias prevention.

My idea was more simplistic, just a phrase here and there… But that would not prevent mischief…

The data trust idea @kathyreid mentioned is very promising. Some people may allow their data to be used for TTS and a separate dataset can be created for this purpose. And/or, only people who contributed N recordings can load the data and they must be a member for this to happen. Etc etc…

I’m eager to hear more resolution ideas and expect this issue to be resolved in a rapid manner, wrt to Mozilla Common Voice Governance Framework v1.0

With GANs and deep fakes becoming common knowledge, I’m not sure how I can ask people to contribute as a (former?) language representative.

1 Like

Can we consider allowing users to choose the license for their data, so that those who do wish to contribute their recordings to the public domain can do so? As the LibriVox site says, “All our audio is in the public domain, so you may use it for whatever purpose you wish.” which I think is clearly part of the ethos of that project, and often discussed among that community. I think that a similar ethos also is first and foremost for many Common Voice contributors, though I haven’t seen as much discussion of it among the community.

I also think it’s important to differentiate between using a voice to develop a TTS model, and voice cloning. It sounds to me that the main worry is over voice cloning, where the result will be a model which sounds just like the original single person’s voice, and can be used for impersonation etc. The wider application of TTS however can use many voices together to synthesize voices that don’t sound like any single voice in the corpus. In fact for several Common Voice languages, this corpus may be the only or best option for developing a speech synthesizer. If the community wants to donate their voices for this purpose, it would be nice for them to have an option to do so.

2 Likes

Yes @cjbaker, I was referring to voice cloning. Having my voice mixed with many others to generate a generalized model will be similar to STT.

On the other hand, one can even argue against it (sorry for the analogy given below, take it as a mind exercise):

A company produces guns and bullets (e.g. Wall-E) from raw material (our voices) and makes these available on the streets for everyone without keeping records (permissive licenses without logs). You also teach the people how to assemble and use it (Huggingface, github, etc.). Also, the country’s laws allow these (our dead-slow pace legal systems which cannot keep up with tech development). Some people take a gun, push the bullet in (a specific voice used to finetune), use it to do harm…

Whose fault? Is it just a “crime” case? Is it related to ethics? Should we require permits and keep records? Should we limit the production - even ban personal guns?

In my opinion, this is a classic Oppenheimer case. You cannot blame only the user here. And these are valid for the whole AI arena nowadays.

I understand some of the concern @bozden, and I think it’s certainly good to be cautious and to understand risks. At some point though, many people will decide to just trust their fellow humans, and to engage in endeavors with such huge potential benefit. If we’re talking about allowing someone to voluntary donate their voice, especially if they understand the potential applications, then I see no comparison with gun violence.

I would suggest changing the title away from “forbidding TTS usage” if that’s not what we’re talking about. TTS is such a wide and important application: just to mention one use case, it allows blind people to use computers and to navigate and participate in the Internet.

I’m in favor of @kathyreid’s proposal for a permission model, to allow users individual choice over the use of their data, if there is truly a desire among users to move away from contributing to the public domain.

2 Likes

I would suggest changing the title away from “forbidding TTS usage” if that’s not what we’re talking about.

Thank you for this. I relaxed the title :slight_smile:

if they understand the potential applications

I think one side of the problem we have here (or similar projects) is:

We only give the positive sides of it, how it will be useful, how good it is, why they should donate, how they are protected etc. All are part of “marketing”.

And people like you and me need the data (scientific, technological, or just hobby usage), and we support having large amounts of data. So we like the marketing side, we keep quite and support, do campaigns, etc…

Mozilla & Foundation, as a whole, are very important. They do lots of work around privacy, security, democracy, rights, and advocacy against digital monopolies etc.

Perhaps we might need to extend these also into this project, warning people beforehand how their data ALSO can be misused, how they can protect themselves against it, create multi-lingual videos about it, etc.

After that, people can decide if they would participate and what they will permit.

  • Is voice biometric? Yes.
  • Is biometric data protected by (most if not all) legislations? Yes.
  • Can biometric data misused? Yes.
  • Should worldwide data collectors regard ALL these legislations (generally and per legislation)? Yes.
  • Should these data collectors protect the collected biometric data generally and individually/per person (with EULAs, technical measures and ethically)? Yes.
  • Does the data collector have responsibility if somebody misuses the data due to missing EULA/security etc on the collector’s side? Yes.
  • Should these (procedures, code, EULAs, contracts, etc) be regularly updated according to new data and laws? Yes.
  • Etc etc…

All must be “yes”, otherwise there sure will be a smoking gun.

I’m not sure why the team is picky about the use of github/issues for licensing related issues, but the related feature request is closed by with the following comments:

additional restrictions cannot be added to Creative Commons licenses

This is correct. There is no “restrictive cc license” like in GPL vs LGPL. I think people would benefit from such a “LCC” license, but this is not a problem of CV.

On the other hand, there is already a restriction put on CC-0 by Common Voice, which becomes invalid by this comment/Mozilla Legal’s view, namely “determining the identities”, which would make the problem worse:

I did not try to suggest a way of doing it, but such a simple wording change was in my mind:

  • You agree to not attempt to determine the identity of speakers in the Common Voice dataset or use it to clone individual voices

Assuming this is what it can be done on a CC0 licensed dataset. If this cannot be done, Common Voice licensing needs a change.

There have been quite a few talks and suggestions about CV languages and locales, such as:

  • Licensing of text-corpora can be relaxed to CC BY, as we already include license
  • Complete dataset license can be set to CC BY-SA, which would also force the dataset users to share it publicly
  • Licensing can be set language based and that should be determined by the community (by @ftyers, my favorite one as that would democratize languages from central decisions) - also in line with @kathyreid’s data trust proposal.

meaning this topic is better addressed as a future policy issue

I hope this means “immediate future” as the threat is imminent.

Hello! Sorry, that was my call to move this from an issue to a conversation, as I’m trying to keep (at least for my triage workflow) Issues limited to relatively narrowly scoped technically-focused discussions or decisions whereas this one, as you point out, could ask us as a project to reexamine licensing, which would involve a larger and more holistic reworking of the Common Voice project.

I know it’s a bother to move more involved discussions to the forum and I do appreciate your patience. As questions and discussion touching on licensing would impact non-technical users and contributors, I like to try and move them here where there’s not a barrier of a Github account and the text is more easily searched from the web.

2 Likes

I agree with @jesslynnrose, I think this is the best place to elicit a range of views from different stakeholders.

I think I’m being misunderstood. I have nothing against the forum discussion, it was I who opened this post :slight_smile:

Community participation is a must and the path -which includes this discussion- is already well-defined in the Mozilla Common Voice Governance Doc V1.0, as I mentioned elsewhere.

Not to struggle, I’m OK with staff decisions, but I had a feature request, and it stands.

And I find it very important as it has security and privacy-related repercussions.

I’m also under (self) ethical pressure as I made hundreds of people join and even made them fill in their demographic info. I feel I need to inform the community (99% of them are AI illiterate) about the issue and provide methods for mitigation - before it is late.

I hope this legit request gets resolved before then…

The only issue I see with such a forum discussion and the governance document is the possibility of conflict of interest. Dataset users who benefit from it might have a more liberal view, I’m afraid.

Another possibility to approach this “situation” could be:
Moz develops tools which recognizes and mark/flag deep fakes (voice and film) and voice only clones.
This would be on a bigger scale than CV - the keyword is here fake news on upcoming elections to create confusion.

Actually, I’m not worried about a politician’s voice being replicated from CV datasets, they speak everywhere, but not on CV.

My concern is voices of our communities end up on stuff like talking sex-bots, animated porn, or upcoming generative model-based video content/movies (where they would need many many voices).

1 Like