Too many recordings?

The top contributor for my, Georgian community has 24K+ recordings and shows no signs of slowing down. They add 100+ recordings daily.

Do they have too many recordings?

As far as I understand how machine learning works, so much data from one person can make the STT biased. Is that so? If so, is there any way to address this topic?

One option that comes to mind, is that the data set has each user’s data marked so that developers might exclude part of Nemo’s recordings.

One more question. What is the recommended, top amount of recordings for one person? Based on Common Voice’s info that 1000 volunteers recording 45 sentences daily will go up to 10K hours in 6 months translates to 8100 recordings for each person.

Note: The following opinions are my own, based on experimentation. They are not official views of Common Voice, and might even oppose them.

As far as I understand how machine learning works, so much data from one person can make the STT biased. Is that so?

AFAIK, voice bias is the most important among all kinds of biases, which also introduces gender, age, and accent bias.

Do they have too many recordings?

AFAIK, that depends on the model and task at hand. E.g. in Whisper multilingual model is one of the more robust ones wrt biases. If you finetune the multilingual model with somewhat biased data, it will not get too much biased in comparison to training from scratch. But you would never know until you train and test it in the real world.

If so, is there any way to address this topic?

Some points about this:

  • CV does not impose any limitations on recordings per person, technically or as a suggestion.
  • It is the splitting algorithm’s job to deal with multiple recordings per person and/or sentence. Current CV default splits (CV CorporaCreator repo) puts one recording/sentence and a voice only exists in a single split (train or dev or test). And it disregards all other recordings. So, many of his/her recordings will not be included in the training. But, because demographics/voice diversity is also NOT taken into account in that algorithm, it limits the training and even makes it worse sometimes (due to dropped voice diversity).
  • Voices (~persons) are already identified by the client_id field to some extent, so there is no need to tag it with an additional column. But client_id is not an exact science, as a single person can have multiple accounts and/or use the system un-logged-in from different devices, so a single person can have multiple client_id’s.

go up to 10K hours

That is an amount specified years ago, which might be required to train from scratch (i.e. starting with random model parameters). But we now have transfer learning, and given an English model (which already has somewhat set parameters of a Western language) you can reach the same point with much less data (e.g. 1-2k hours of recording, depending on the language and model).

Also, if you are finetuning a multilingual model (e.g. Whisper) with CV data, 100 hours can be enough to get somewhat good results.

With large-scale models trained on multiple hundreds of thousands of hours like Whisper (>600kh), having all those recordings, even from a single dominant person, overrides the effect of biasing (I’ve been experimenting on this, some graphs are shared at the bottom). Think of this scenario on gender bias for example:

De-biased dataset with much fewer data gives these WER results:
Overall: 30%
Male: 28%
Female: 32%

A more biased dataset with much more training data gives these WER results:
Overall: 20%
Male: 15%
Female: 25%

Which one is better?

Currently, SotA models require “more data”, whatever they contain - as the robustness became better. I’m not saying bias is not important, it is VERY VERY IMPORTANT, but it can become more relaxed with SotA models when fine-tuning for example.

What can be done?

  • You might build a community, promote some social media channels and direct the community for a better dataset (“do not record more than N as of now”, “more woman voices are needed”, etc.)
  • You can counteract that individual/individuals by balancing, e.g. if he is a male recording too much, make many women record much.
  • Increase the number of distinct voices through campaigns gradually.

In v8.0 we (Turkish) had the same problem, a single male with a heavy accent made the result worse than v7.0, so we counteracted that with the above methods for the sake of further releases. We also devised a more robust splitting algorithm.

Obviously, 1000 people having 1000 recordings each is better than 1000 people having 100 recordings each. And obviously, again, if a dataset has voices from a few people dominating, the trained model will fail in the wild.

Also, beware: The top list only includes people who give permission to be seen, there might be others. You can only see the results from released datasets, e.g. through the Common Voice Analyzer - check the Voices tab.

Some experiment results

Below: Whisper Tiny multilingual finetuned by CV v14.0 Turkish with different algorithms:
s1 (black) - CorporaCreator default splits (single recording per sentence)
s99 (blue) - CorporaCreator max 99 recordings per sentence => much more data
v1 (red) - Alternative algorithm which uses all validated recordings.

image

Same for Uzbek (uz) given below. X-axis is steps, more training data, more steps.

image

Below: Coqui STT (Deepspeech) results for 3 different languages, using CV v10.0. From left to right (more training data) s1, s99 and v1 algorithms.

3 Likes

Addendum about your calculations on “how much time or recording needed per day”.

For simplicity, suppose these:

  • Average recording duration in your corpus is 3.6 seconds. So 1 hour recording means 1000 recordings.
  • It is good to have multiple recordings per sentence, from diverse genders/ages/accents. Say, you aim for 3 recordings per sentence on the average.
  • Suppose your language and model requires 1000 hours to give good results for your application.
  • Suppose your community can produce 1 hour VALIDATED recordings per day (i.e. 1000 recordings) on the average.

As a results, to reach your 1000 hours goal:

  • You would need ~333k different sentences.
  • You need 1000 days (2.74 years) to reach your goal.

I’m assuming you use ALL validated recordings. If you limit that, it will take longer.

In my experience, 1 hour validated recordings per day can be reached only with a large contributor base (e.g. English), or with constant effort from community leads who can direct many campaigns - required in our case with ~1000 diverse voices.

3 Likes

Thanks for the response @bozden. Your comments are always thorough.

1 Like