Need Common Voice admin help with a volunteer

Top contributor for the Georgian language, Nemo has too many clips - 34K+. And she keeps adding new ones (in August she had 24K).

This is an issue because most of her clips aren’t used in the STT training. If they are, the STT would be biased toward her type of voice. It’s not hypothetical, I have talked about this with the developer of Enagramm platform. He said that they didn’t use most of her recordings. And I don’t believe there is any situation where so many recordings of one person might be useful (correct me if I am wrong).

Nemo keeps on recording clips, and all this work turns out to be in vain. I tried to find out who she is, to inform her about the issue but didn’t mange to. So I want to ask Common Voice admins, please, contact her and inform about the issue. If she still decides to continue recording clips, that’s her choice, I just want her to make an informed choice.

@Razmik-Badalyan, why not catch her performance with more diverse voices :slight_smile:

1 Like

Do you mean for more volunteers to record 34K+ clips? Call me pessimistic but I don’t think that is doable when I look at the activity level of our community :sweat_smile:

1 Like

At first, I was also thinking like you do, but my models say otherwise, as I explained before.

It might seem a waste of volunteer time for now, but please be aware that dataset-building projects like Common Voice have long spans, measured by years.

Accuracy in machine learning models gets better with exponentially more data. Just to give some hypothetical numbers: You can drop WER from 50% to 40% with +100h of recordings, but to go from 20% to 10% you would need +1000h, much more if you need to go from 5% to 3%.

Georgian has ~4M native speakers, and you have 1254 different (?) voices, which is a good number (at least much better than our sample size for Turkish). You’ll probably could not reach 1% of the population (40.000) with campaigns etc, so you will increase that gradually, but having those people record more will be more important. As I explained previously, more data is better.

One can easily use 5k recordings from Nemo in training. In a couple of years, many people from the community will also reach 10k+, so her recordings will be used more, and they will not be wasted (except bandwidth perhaps). And probably she will quit after some time, but her contribution will live…

We will never be able to get 1M different people recording diverse sentences (ideal case). So we need 1-2000 people to record 1000s of sentences and try to enlarge the voice diversity.

1 Like

There’s also another issue to consider here. If Nemo has recorded 34,000 (approx) clips, and clips in Common Voice have a duration of around 5 seconds, then this will be around 47 hours of one person’s voice.

This is enough to train a TTS / speech synthesis model, reasonably accurately (depending on the phonetic coverage of the Georgian written sentences). That is, there is enough data in the Georgian dataset for someone to synthesise Nemo’s voice, because the TTS models (like VALL-E) require less and less data over time.

This is an ethical issue because the unique client_id value in the dataset denotes “her” recordings, and allows them to be extracted. So there is a risk here that someone may create a synthetic model of her voice without her permission, and use it in ways that she is not comfortable with (there are many reasons that unscrupulous developers would want to train a synthetic woman’s voice, which I will not detail here).

There’s also a flipside benefit to this as well. Nemo may be recording so much data so that she can train a TTS model of her own voice, in which case we may not want to limit her efforts.

I am going to flag @jesslynnrose and @gina here as they may have additional means for reaching out to Nemo to make her aware of the situation.

2 Likes

Thank you @kathyreid, I didn’t want to re-mention the TTS issue. It is a serious problem and you know my views on this (and Mozilla is not acting on this).

I think, not only a single person, but everybody should be made aware of this.

I personally asked my (female) family members and friends not to record cleanly anymore. If they record, they should have background sounds, such as music and/or open TV, so that their recordings will become worthless for TTS.

I’ve done here some napkin calculations and with this estimates 10 000 hours is around 8 100 000 clips.

Taken this total number of clips (8M+) we don’t need 1M voices. That would be overkill; truly not realistic. But 20K, and 30K voices seem quite doable. And with this amount of voices, each volunteer would have to do just 300-400 clips to get 8M in total.

My bet is on influencers. I need to convince 4 to 5, relatively big, Georgian influencers to post about the CV and, I believe we can pass the first 5K voices in a short time. I plan to couple this with a 100-day clip-recording marathon for Georgian language day (14 April 2024).

2 Likes

As far as I know, TTS training requires studio-quality clean recording (correct me if I’m wrong). Nemo’s clips do not have such quality.

Thank you, I hope they will contact her. Let me be clear, I’m not saying she should stop. I want her to be informed of the issue.

If you don’t mind, some comments on this… Here is your v15.0 recs/voice (validated):

  • Less than 5% of people recorded more than 128 sentences.
  • Most of them just recorded 5 sentences and moved away (they tried!).
  • Your average is ~63 sentences, with the top performers’ effect, if you leave them out, it will drop.

This will always be similar, check other languages. These few recordings are also valuable, because they usually go to test & dev splits. And the top performer voices go to the train split.

10 000 hours is

Don’t aim for the 10k, we have fine-tuning and transfer learning now, and 10k was an old figure. I’d say, go stepwise, 100 - 200 - 500 - 1000 - 2000 hours. You might like to put yearly targets this way, far away targets make people scary. You are past 100h, so select 200/250/300 for 2024 for example.

Your avg. rec duration is about 5 sec, so for 1000h (say end of 2025), you would need 650k new recordings, ~200k new sentences, 2-3k new voices (300 recordings on average each). Don’t forget, it will again be like a normal distribution…

I’d advise for longer sentences thou. Try to get some longer recordings, SotA models work best with 5-25 sec recordings.

My bet is on influencers.

We were not successful with this, but AI was not a thing at that time… I hope you’ll get better results.

Georgian language day

That was successful with us, try to start the campaign teasers 1 month before.

As far as I know, TTS training requires studio-quality clean recording

That was in the past. As @kathyreid emphasized, there is VALL-E now. This is a pre-trained base model and with even a small amount of data (3 seconds!!!), you can finetune it to other voices. High quality and longer will be better of course, but clean (no cracks, background sounds etc) CV recordings can easily be used for it.

Because of TTS-based scams, they removed high-quality models from public access, but they are there and people do replicate the science.

3 Likes

I don’t see that much of a problem. The contract already provided for the use of TTS. It is out of Mozilla’s control and limiting usage seems to devalue the dataset…

Hi @Razmik-Badalyan

Thank you so much for bringing this matter to our attention.

Kindly note that we are in the process of updating our documentation to explicitly communicate the risks associated with voice contributions. Once these updates are implemented, we will share them with the community.

We will reach out to the user to ensure they are informed about the potential risks her recordings may have. While we cannot prevent contributors from continuing their contributions, however our aim is to raise awareness about the potential risks involved. In some cases as @kathyreid mentioned, some contributors may want to train their speech models using their own voice data, in which we cannot intervene.

Thank you @bozden and @kathyreid for sharing your insights and responses. We highly value your input, and I want to assure you that we are actively working on addressing these issues internally.

1 Like

Thank you for your suggestions :pray:

მადლობა Gina :pray:, looking forward to seeing the new guidelines.

Hi all. At Enagram, we are in the process of training a Speech-to-Text (STT) model using Georgian Common Voice data. Our data scientist has observed that out of the 110 hours of recordings, we have utilized only 41 hours. This selection was necessary to prevent bias towards a few dominant voices in our STT model. Consequently, we wish to inform our contributors that the availability of 110 hours of recordings does not imply that all of these hours are suitable for STT training. It is more beneficial to contribute recordings featuring a variety of voices rather than a predominance of a few voices, as the latter may not be used in the end due to concerns about bias.

2 Likes

Thank you for your suggestions :pray:

Hi @Rati_Skhirtladze, nice to have you here, welcome.

If you don’t mind me asking: What model architecture are you training? What workflow do you use (finetuning, etc)?

Did you try creating a model which uses the whole dataset and compare the results with your current model?

But individual contributors/voices can only contribute recordings of their/our own voice. A few of the messages here (not just what I’m replying to) seem to suggest that it is possible to just dial down the contributions of one person and dial up the contributions of more different people, but… that’s not how volunteer‐based projects work.

If you have one person willing to submit 50 hours of recordings of their own voice, that doesn’t also mean you have 100 people willing to submit 30 minutes of their voices (or even 50 people willing to submit 10 minutes). It’s not one or the other. 🤷

IMHO, Razmik had the right approach with not wanting to ask this contributor to stop, but rather to inform them of how their continued contributions will be exponentially less useful as well as how they might subject themselves to getting “synthesized” by scammers etc. – ie., to make sure they continue contributing informedly, not to tell them off.

Oh, and also…

you.

As others have mentioned, they might be contributing all of this data because they want to use it for their own purposes. (I started contributing because I was(/am) planning to eventually train a speech recognition on me.)

Just because volunteers acting a certain way would be more beneficial to you and your use case, it doesn’t make it universally so. One of the beautiful things about free and open data sets is that we don’t know what use cases other people can think of and just because not all parts of the data set apply to what we can think of or what we might want to use it for presently it doesn’t mean that those should be discarded from the set as a whole… you never know what someone else might want/need, and maybe this exact data is what makes this data set perfect for their use case.

2 Likes

Hi Gina, could you please give me an update on this matter? On the leader-board, I can see that she still keeps on recording clips. Just wanna make sure that she is informed about potential risks.

Hi @Razmik-Badalyan , I apologize for the delay. Kindly note that I will provide you with feedback soon.

1 Like