I want to know whether the number of voices can represent the number of speakers and what is the total number of voices in Common Voice11.0? When I select different languages I can see the number of voices specific to different languages, but I want to know is there a total number of voices? I would appreciate it very much if anyone could give me an answer.
The number depends on what you want to achieve.
According to the metadata, v11.0 reports total 271,817 users (sum of languages).
I can give more detailed information if I know where it will be used.
Thanks! That helps me a lot.
But beware, that number is not necessarily the exact number of distinct people because of various reasons, it can be taken as an estimate thou.
Could you give us a rough estimate of the number of speakers? I will use it for research purposes. Also, would a user be considered a different user if they recorded audio through the same browser but at different times?
No, I don’t think anybody can because of the privacy rules in place.
The limiting factors are listed below, but first some info: The system keeps the users in field “client_id”, AFAIK, that value is created from the browser session the user has, I don’t know the exact mechanism thou.
I mean, how would it behave in different situations: Will it be different on different windows (for browser x, but what about y?), different tabs, different profiles on same device/browse, mobile device browsers (apps), privacy related plugins, VPNs, after deletion of cookies etc -> the combinations can be very large.
- People having a profile and record while logged in have a unique client_id for sure.
- If a person has a profile and gets logged out and keeps recording, he/she will have the same client_id.
- People can record without logging in, so they will get a client_id. Unknown: What will happen if the browser is closed and reopened, cache & cookies cleaned, opened a new profile, opened in another browser, etc?
- People can record without logging in on different devices, e.g. desktop and mobile, they would have different client_id’s.
- A person can create multiple accounts using different emails and record through both of them, they will be seen as different users.
So, there is no way to get an exact number, neither an estimate.
Our best estimate is the number I’ve given above.
A better estimate can be reached if more people get registered and keep recording logged in, across multiple devices. Uniqueness, on the other hand, can only be reached by very strict rules, such as those used in legal systems through a legal id or such, which is out of the question of course.
Two additional notes about the number given above:
- These are all voices, not voices in validated recordings. We know there are some visitors that come and record some “spam-like” sentences and they get invalidated.
- A person (client_id) can exist in more than one language if he/she does contribute more than one language. The number I have given is not deduplicated value, it is just sum of the numbers in metadata.
To get rid of these effects, one should join the whole validated buckets in all language datasets and dedup it. I’d estimate a 10% drop, resulting in e.g. 250k.