Note: The following opinions are my own, based on experimentation. They are not official views of Common Voice, and might even oppose them.
As far as I understand how machine learning works, so much data from one person can make the STT biased. Is that so?
AFAIK, voice bias is the most important among all kinds of biases, which also introduces gender, age, and accent bias.
Do they have too many recordings?
AFAIK, that depends on the model and task at hand. E.g. in Whisper multilingual model is one of the more robust ones wrt biases. If you finetune the multilingual model with somewhat biased data, it will not get too much biased in comparison to training from scratch. But you would never know until you train and test it in the real world.
If so, is there any way to address this topic?
Some points about this:
- CV does not impose any limitations on recordings per person, technically or as a suggestion.
- It is the splitting algorithm’s job to deal with multiple recordings per person and/or sentence. Current CV default splits (CV CorporaCreator repo) puts one recording/sentence and a voice only exists in a single split (train or dev or test). And it disregards all other recordings. So, many of his/her recordings will not be included in the training. But, because demographics/voice diversity is also NOT taken into account in that algorithm, it limits the training and even makes it worse sometimes (due to dropped voice diversity).
- Voices (~persons) are already identified by the client_id field to some extent, so there is no need to tag it with an additional column. But client_id is not an exact science, as a single person can have multiple accounts and/or use the system un-logged-in from different devices, so a single person can have multiple client_id’s.
go up to 10K hours
That is an amount specified years ago, which might be required to train from scratch (i.e. starting with random model parameters). But we now have transfer learning, and given an English model (which already has somewhat set parameters of a Western language) you can reach the same point with much less data (e.g. 1-2k hours of recording, depending on the language and model).
Also, if you are finetuning a multilingual model (e.g. Whisper) with CV data, 100 hours can be enough to get somewhat good results.
With large-scale models trained on multiple hundreds of thousands of hours like Whisper (>600kh), having all those recordings, even from a single dominant person, overrides the effect of biasing (I’ve been experimenting on this, some graphs are shared at the bottom). Think of this scenario on gender bias for example:
De-biased dataset with much fewer data gives these WER results:
Overall: 30%
Male: 28%
Female: 32%
A more biased dataset with much more training data gives these WER results:
Overall: 20%
Male: 15%
Female: 25%
Which one is better?
Currently, SotA models require “more data”, whatever they contain - as the robustness became better. I’m not saying bias is not important, it is VERY VERY IMPORTANT, but it can become more relaxed with SotA models when fine-tuning for example.
What can be done?
- You might build a community, promote some social media channels and direct the community for a better dataset (“do not record more than N as of now”, “more woman voices are needed”, etc.)
- You can counteract that individual/individuals by balancing, e.g. if he is a male recording too much, make many women record much.
- Increase the number of distinct voices through campaigns gradually.
In v8.0 we (Turkish) had the same problem, a single male with a heavy accent made the result worse than v7.0, so we counteracted that with the above methods for the sake of further releases. We also devised a more robust splitting algorithm.
Obviously, 1000 people having 1000 recordings each is better than 1000 people having 100 recordings each. And obviously, again, if a dataset has voices from a few people dominating, the trained model will fail in the wild.
Also, beware: The top list only includes people who give permission to be seen, there might be others. You can only see the results from released datasets, e.g. through the Common Voice Analyzer - check the Voices
tab.
Some experiment results
Below: Whisper Tiny multilingual finetuned by CV v14.0 Turkish with different algorithms:
s1 (black) - CorporaCreator default splits (single recording per sentence)
s99 (blue) - CorporaCreator max 99 recordings per sentence => much more data
v1 (red) - Alternative algorithm which uses all validated recordings.
Same for Uzbek (uz) given below. X-axis is steps, more training data, more steps.
Below: Coqui STT (Deepspeech) results for 3 different languages, using CV v10.0. From left to right (more training data) s1, s99 and v1 algorithms.