Building a training data-set of kids voices

We are building an educational platform for economically disadvantaged kids aged 4 - 6 and are planning on incorporating Common Voice into it in order to help kids improve their English reading skills.

Initially, we plan to build a game where the child reads out individual words of a story, and our game gives feedback on whether or not the child has pronounced them correctly.

Before we do that, we obviously need a training data-set for kids voices.

We have the ability to collect kids voices, but before we start doing that, I’d love some advice from the community on exactly what to collect. Specifically:

  1. How many unique kids’ voices should we aim to collect? I know more is better, but since we have limited budgets, what’s a realistic number of individual kids’ voices we are looking at in order to have a reliable training data-set?

  2. For each kid, how many recorded words should we collect on average?

  3. How many unique words do we need to have recorded voices for in the whole training data-set? Is it better to have a large number of unique words (but therefore a fewer number of sample recordings per word), or fewer unique words (and therefore a larger number of sample recordings per word)? What combination of number of unique words, and number of sample recordings per unique word, should we aim to get?

If you need more information from me to be able to answer the above questions, please let me know.



I’m really interested in getting the beaver scout group I work with to record some sentences. I want to build this into a segment looking at how technology can be used to help people. Did you make any progress identifying useful words for us to work with?
Looking at 30 children aged 4-8 years.

Sorry nobody replied the original message in 2018 here.

We are not allowed to collect children voices, this is a legal limitation, you must be 19 or older in order to contribute your voice to Common Voice.

Thanks for your understanding.

Taken from the the Legal Terms for Common Voice:

If you are 19 or under, you must have your parent or guardian’s consent and they must supervise your participation in Common Voice.

Doesn’t this allow collection of children’s voices, as long as the parents/guardians consent & supervision. (Though realistically, it might be hard for a parent to constantly be doing)

Currently we don’t have a formalized process to gather these consents, and probably we won’t have bandwidth if we are talking about hundreds of consents.

@nukeador you are correct that we don’t currently have a way of gathering consent from the parents and are not collecting voices from people under 19.

While this is a very interesting project, building a consent mechanism and collecting children’s voices is not something that is currently on the roadmap for 2020.


I see that is not easy to sketch a suitable process that integrate under 19 years old contribute, compliant with legal terms (see also Bad words words list for your languages ), but I don’t see the “children’s voice” as a so specific / different realm, if our goal is to achieve high quality, including “diversities” in common spoken “language model” definition.

Under 19 are people as adults… and excluding their contribution will build a biased dataset. That’s bad, immo.


Do you need to actually collect consent from the parents/guardians of those under 19? iirc services usually just include it in the terms of service as a plus-and-play

Our legal team asked us to if we need to and we don’t want to take any risks. The reality is that right now we don’t have the bandwidth to do so.

@r_LsdZVv67VKuK6fuHZ_tFpg I don’t know if maybe we should reflect this on the site to avoid false expectations.

@nukeador lets discuss this in our next meeting to find best path forward and where to reflect this for contributors.

with the goal of teach machines to how real people speak, my praise :pray::pray::pray: is to find any possible way to enable contributions from under 19 people and possibly from any people).

@r_LsdZVv67VKuK6fuHZ_tFpg that’s not clear to me, because in the dataset stats report in CV website I read that there is a percentage of contributors under19!

So do you mean that under19 contributions 6% < 19 (see below) are discarded in the CV backend? I hope no!


23% United States English
9% England English

21% 19 - 29
15% 30 - 39
8% 40 - 49
5% < 19
4% 50 - 59
3% 60 - 69
1% 70 - 79

47 % Male
11% Female



32% 19 - 29
19% 50 - 59
11% 30 - 39
10% 40 - 49
6% < 19

62% Maschio
18% Femmina

Warning :warning::warning:
more in general, let me point out again my real concern in a possible big dataset bias with recordings with many restrictions: btw, the poor percentage of female contribution will achieve a gender bias :roll_eyes:.

Please take my comments always as positive and proactive. I love too much the goal of opendata and open source language tech.


@heyhillary Hello Hillary,
What is the latest status for this issue?

We are going to the university to collect voices from students, not allowing under 20, means missing a huge portion of voices.

I checked the Abkhazian dataset, 19 and under are collected and are part of the dataset.

If the contributor is under 19, the data is collected under teens category, if the contributor is 19, then the data goes under the twenties category.

In the legal terms of Common Voice, it states that 19 or under need approval from the parents or the guardians, but in the website the 19 years are collected in the twenties category, so parts of the twenties category has a 19 year olds - those who need approval from their parent/guardian, the data in this sense is polluted.

We need to get a clear policy on this. @heyhillary

Good catch! But I think “teens” is “<19” and “twenties” is “19-29”. I don’t think it is a pollution.

As far as I can see, the legal decision came out later after CV started to collect data. As the data is not well grained there was no way to correct the database (add 19 yo to teens or whatever).

And in different parts of the world there are different laws regarding to as counted as adult.

e.g.: germany : with 18 years you can do what you want (theoretically :rofl:).

For children, juveniles under 18 are protected by law.
If you have the permission from the parents (both!) you are on the safe side.
If one part is withdrawing the permission the administrative challenge begins (to say it that way…)

I mean by pollution, you have 19 years old kids in the twenties category, there’s no clear policy how to handle 19 and under, that category is polluted with 19 year olds - those who might or might not have the permission from their parents, which is not clear how to handle it on Common Voice.

I understand that, by the legal terms of Common Voice, it states 19 years and under need approval.

I mean by pollution, you have 19 years old kids in the twenties category, there’s no clear policy how to handle 19 and under, that category is polluted with 19 year olds - those who might or might not have the permission from their parents, which is not clear how to handle it on Common Voice.

:smiley:Congratulations :smiley: this is called data managment or data adjustment.:smiley:

If a court/person gets the impression: underage and biometric voice recordings and saving/releasing/processing this without proof of doing so…oh my good night.

In the worst case (without proof of permission) this would mean a full wipe of those underage clips (and not moving the clips to graveyard!)

No offense,no harm, just my concerns!

-A solution could be: CV legal terms based on the country from which you are contributing (and for what language) It is exactly stated what age is required (in that country) for contributing without permission as adult. Cv has no interest in underage clips.
[But how to deal with vpn/tor/anonymous contributers???]

  • Building up CV servers for every continent/country, collecting the data under the legal terms of this continent/country and relasing that language from there to the public. (much more expensive)
    The language community (for this country) decides if underage clips are collected and how to deal with permissions.
    For contributing as underage : 2 permissions from different emails for the underage account??? - if cv is interested in underage clips.

Also, i guess a common (actual) practice is: using the account of the adult to record also the child.
So much for the “pollution” - misleading labelling of voice clips included.

I also doubt that a fully grown up/adult 16 year old united kingdom contributer is asking his parents for permission to contribute to CV and they supervise them, because CV is stating over 19 in their terms. And here comes the best part:
They do not have to do so in the uk (uk law)!

CV must find a way to value diverse laws for different countries/continents, especially on a project with this big scale. The declared age by CV is not valid for the rest of the world.

What are the future plans of CV/Mozilla dealing with this “situation” ? Talk to us!

Millions of saved words, but not cc0 :weary:

Thanks for pointing this out!!

In a validation session i first heard a child talking in the background a mature person was reading and recording, then 5 sentences just the child, afterwards the mature person again.

Also the numbers for age of maturity are in the range of 15 to 21 years worldwide. Exception Iran (female 9/male 15) !?

The idea of switching to legal terms and/or servers by country (from which country you are contributing to cv) is too far out ???