Building a training data-set of kids voices

We are building an educational platform for economically disadvantaged kids aged 4 - 6 and are planning on incorporating Common Voice into it in order to help kids improve their English reading skills.

Initially, we plan to build a game where the child reads out individual words of a story, and our game gives feedback on whether or not the child has pronounced them correctly.

Before we do that, we obviously need a training data-set for kids voices.

We have the ability to collect kids voices, but before we start doing that, I’d love some advice from the community on exactly what to collect. Specifically:

  1. How many unique kids’ voices should we aim to collect? I know more is better, but since we have limited budgets, what’s a realistic number of individual kids’ voices we are looking at in order to have a reliable training data-set?

  2. For each kid, how many recorded words should we collect on average?

  3. How many unique words do we need to have recorded voices for in the whole training data-set? Is it better to have a large number of unique words (but therefore a fewer number of sample recordings per word), or fewer unique words (and therefore a larger number of sample recordings per word)? What combination of number of unique words, and number of sample recordings per unique word, should we aim to get?

If you need more information from me to be able to answer the above questions, please let me know.



I’m really interested in getting the beaver scout group I work with to record some sentences. I want to build this into a segment looking at how technology can be used to help people. Did you make any progress identifying useful words for us to work with?
Looking at 30 children aged 4-8 years.

Sorry nobody replied the original message in 2018 here.

We are not allowed to collect children voices, this is a legal limitation, you must be 19 or older in order to contribute your voice to Common Voice.

Thanks for your understanding.

Taken from the the Legal Terms for Common Voice:

If you are 19 or under, you must have your parent or guardian’s consent and they must supervise your participation in Common Voice.

Doesn’t this allow collection of children’s voices, as long as the parents/guardians consent & supervision. (Though realistically, it might be hard for a parent to constantly be doing)

Currently we don’t have a formalized process to gather these consents, and probably we won’t have bandwidth if we are talking about hundreds of consents.

@lsaunders ?

@nukeador you are correct that we don’t currently have a way of gathering consent from the parents and are not collecting voices from people under 19.

While this is a very interesting project, building a consent mechanism and collecting children’s voices is not something that is currently on the roadmap for 2020.


I see that is not easy to sketch a suitable process that integrate under 19 years old contribute, compliant with legal terms (see also Bad words words list for your languages ), but I don’t see the “children’s voice” as a so specific / different realm, if our goal is to achieve high quality, including “diversities” in common spoken “language model” definition.

Under 19 are people as adults… and excluding their contribution will build a biased dataset. That’s bad, immo.

1 Like

Do you need to actually collect consent from the parents/guardians of those under 19? iirc services usually just include it in the terms of service as a plus-and-play

Our legal team asked us to if we need to and we don’t want to take any risks. The reality is that right now we don’t have the bandwidth to do so.

@lsaunders I don’t know if maybe we should reflect this on the site to avoid false expectations.

1 Like

@nukeador lets discuss this in our next meeting to find best path forward and where to reflect this for contributors.

with the goal of teach machines to how real people speak, my praise :pray::pray::pray: is to find any possible way to enable contributions from under 19 people and possibly from any people).

@lsaunders that’s not clear to me, because in the dataset stats report in CV website I read that there is a percentage of contributors under19!

So do you mean that under19 contributions 6% < 19 (see below) are discarded in the CV backend? I hope no!


23% United States English
9% England English

21% 19 - 29
15% 30 - 39
8% 40 - 49
5% < 19
4% 50 - 59
3% 60 - 69
1% 70 - 79

47 % Male
11% Female



32% 19 - 29
19% 50 - 59
11% 30 - 39
10% 40 - 49
6% < 19

62% Maschio
18% Femmina

Warning :warning::warning:
more in general, let me point out again my real concern in a possible big dataset bias with recordings with many restrictions: btw, the poor percentage of female contribution will achieve a gender bias :roll_eyes:.

Please take my comments always as positive and proactive. I love too much the goal of opendata and open source language tech.

1 Like