Building a training data-set of kids voices

Currently we don’t have a formalized process to gather these consents, and probably we won’t have bandwidth if we are talking about hundreds of consents.

@r_LsdZVv67VKuK6fuHZ_tFpg ?

@nukeador you are correct that we don’t currently have a way of gathering consent from the parents and are not collecting voices from people under 19.

While this is a very interesting project, building a consent mechanism and collecting children’s voices is not something that is currently on the roadmap for 2020.

Hi,

I see that is not easy to sketch a suitable process that integrate under 19 years old contribute, compliant with legal terms (see also Bad words words list for your languages ), but I don’t see the “children’s voice” as a so specific / different realm, if our goal is to achieve high quality, including “diversities” in common spoken “language model” definition.

Under 19 are people as adults… and excluding their contribution will build a biased dataset. That’s bad, immo.

3 Likes

Do you need to actually collect consent from the parents/guardians of those under 19? iirc services usually just include it in the terms of service as a plus-and-play

Our legal team asked us to if we need to and we don’t want to take any risks. The reality is that right now we don’t have the bandwidth to do so.

@r_LsdZVv67VKuK6fuHZ_tFpg I don’t know if maybe we should reflect this on the site to avoid false expectations.

2 Likes

@nukeador lets discuss this in our next meeting to find best path forward and where to reflect this for contributors.

with the goal of teach machines to how real people speak, my praise :pray::pray::pray: is to find any possible way to enable contributions from under 19 people and possibly from any people).

@r_LsdZVv67VKuK6fuHZ_tFpg that’s not clear to me, because in the dataset stats report in CV website I read that there is a percentage of contributors under19!

So do you mean that under19 contributions 6% < 19 (see below) are discarded in the CV backend? I hope no!


from: https://voice.mozilla.org/en/datasets

Accent
23% United States English
9% England English

Age
21% 19 - 29
15% 30 - 39
8% 40 - 49
5% < 19
4% 50 - 59
3% 60 - 69
1% 70 - 79

Sex
47 % Male
11% Female

from: https://voice.mozilla.org/it/datasets

Accento

EtĂ 
32% 19 - 29
19% 50 - 59
11% 30 - 39
10% 40 - 49
6% < 19

Sesso
62% Maschio
18% Femmina


Warning :warning::warning:
more in general, let me point out again my real concern in a possible big dataset bias with recordings with many restrictions: btw, the poor percentage of female contribution will achieve a gender bias :roll_eyes:.

Please take my comments always as positive and proactive. I love too much the goal of opendata and open source language tech.
:full_moon_with_face::earth_africa::earth_americas::earth_asia::facepunch:

2 Likes

@heyhillary Hello Hillary,
What is the latest status for this issue?

We are going to the university to collect voices from students, not allowing under 20, means missing a huge portion of voices.

1 Like

I checked the Abkhazian dataset, 19 and under are collected and are part of the dataset.

If the contributor is under 19, the data is collected under teens category, if the contributor is 19, then the data goes under the twenties category.

In the legal terms of Common Voice, it states that 19 or under need approval from the parents or the guardians, but in the website the 19 years are collected in the twenties category, so parts of the twenties category has a 19 year olds - those who need approval from their parent/guardian, the data in this sense is polluted.

We need to get a clear policy on this. @heyhillary

Good catch! But I think “teens” is “<19” and “twenties” is “19-29”. I don’t think it is a pollution.

As far as I can see, the legal decision came out later after CV started to collect data. As the data is not well grained there was no way to correct the database (add 19 yo to teens or whatever).

And in different parts of the world there are different laws regarding to as counted as adult.

e.g.: germany : with 18 years you can do what you want (theoretically :rofl:).

For children, juveniles under 18 are protected by law.
If you have the permission from the parents (both!) you are on the safe side.
If one part is withdrawing the permission the administrative challenge begins (to say it that way…)

I mean by pollution, you have 19 years old kids in the twenties category, there’s no clear policy how to handle 19 and under, that category is polluted with 19 year olds - those who might or might not have the permission from their parents, which is not clear how to handle it on Common Voice.

I understand that, by the legal terms of Common Voice, it states 19 years and under need approval.

I mean by pollution, you have 19 years old kids in the twenties category, there’s no clear policy how to handle 19 and under, that category is polluted with 19 year olds - those who might or might not have the permission from their parents, which is not clear how to handle it on Common Voice.

:smiley:Congratulations :smiley: this is called data managment or data adjustment.:smiley:

If a court/person gets the impression: underage and biometric voice recordings and saving/releasing/processing this without proof of doing so…oh my good night.

In the worst case (without proof of permission) this would mean a full wipe of those underage clips (and not moving the clips to graveyard!)

No offense,no harm, just my concerns!

-A solution could be: CV legal terms based on the country from which you are contributing (and for what language) It is exactly stated what age is required (in that country) for contributing without permission as adult. Cv has no interest in underage clips.
[But how to deal with vpn/tor/anonymous contributers???]

  • Building up CV servers for every continent/country, collecting the data under the legal terms of this continent/country and relasing that language from there to the public. (much more expensive)
    The language community (for this country) decides if underage clips are collected and how to deal with permissions.
    For contributing as underage : 2 permissions from different emails for the underage account??? - if cv is interested in underage clips.

Also, i guess a common (actual) practice is: using the account of the adult to record also the child.
So much for the “pollution” - misleading labelling of voice clips included.

I also doubt that a fully grown up/adult 16 year old united kingdom contributer is asking his parents for permission to contribute to CV and they supervise them, because CV is stating over 19 in their terms. And here comes the best part:
They do not have to do so in the uk (uk law)!

CV must find a way to value diverse laws for different countries/continents, especially on a project with this big scale. The declared age by CV is not valid for the rest of the world.

What are the future plans of CV/Mozilla dealing with this “situation” ? Talk to us!

Millions of saved words, but not cc0 :weary:

Thanks for pointing this out!!

1 Like

In a validation session i first heard a child talking in the background a mature person was reading and recording, then 5 sentences just the child, afterwards the mature person again.

Also the numbers for age of maturity are in the range of 15 to 21 years worldwide. Exception Iran (female 9/male 15) !?

The idea of switching to legal terms and/or servers by country (from which country you are contributing to cv) is too far out ???

These solutions are a good start but they are too expensive, time consuming, this would require a dedicated team just to handle this issue.

I prefer a solution that is simple, fast, effective, low maintenance. There is a solution, we just don’t see it yet.

Oh wow, very impressive. Killer phrases at its best. Are these proposals too diverse ???

And that is a growing part of “discussion culture” here on discourse and also sometimes on github:
Oh no, low resources. …
Oh no, low.bandwith…
Oh i do not know what the guys from (insert the actual sponsor here) …
Oh no, this is not possible, because this means major changes…
But hey, thanks for the “discussion”.

It can be so easy in zero and one land…especially with so many contributers involved…from different parts of the world.
.
This works until courts are creating facts and not fiction.
FB also wildly declared in their terms what was easy for them, but did not mention some other things, but hey FB is for free!!!. One law student and a european court had another opinion on this. Resulting in paying big bucks.
“Digital Rights” to ring a bell. (and i do not mean the rights of FB)

Think of the children :innocent:
and
Where is the money???
Pathetic

Also when someone (cv/moz) start a project on this scale (worldwide) then cv/moz can handle it on a worldwide basis. The cheapest is not the best way to handle this on a worldwide scale. That is no rocket science to find this out. Playing sitting duck and pretending everything is well will NOT solve this and missing many (possible) recordings and/or finding ways to use child clips legally on an international basis. Common Voice is too important to fall for that.

General thoughts on getting kid/underage voices:

The basic idea is:
Starting a pilot project in selected school(s).
Locally in the school, online via internet to local cv server, both ways???)
(with official funding by the state/governnent???, funding by moz foundation???)

I learned from some reports that there are reading contests in the U.S. (reading bee, if i remember correctly).
Not the contest itself, but the time of preparation for these reading contests would be interesting for cv.
CC0 sentences are presented to the underage contributor from a local cv server in the school and saved on a local cv server in the school.
Cv/Moz is creating a form to fill out for the parents to give permission that their kids are allowed to contribute to cv.
If this pilot project is offically funded by state/government this filled out permissions stay in the school (teacher who is supervising, or school office)
If not officially funded, it is saved (encryted) in local cv server (proof of permission for cv) Two mature email accounts (parents) are required to activate the account for the underage contributor (child/teeny for contributing online at home)
In a later process a decision is required:
Release this as kids database only or include the underage clips to cv main corpus.

As stated in some posts in this thread before:
Kids voices are also natural. They also use voice assistants with (actual) mixed results. Avoiding (possible) problems is not the solution!

It is also very clear to me the efforts to start this pilot project within cv and state/government involved would be MASSIVE!

This proposal is in direction to the “legal team” and/or moz foundation.

This could be a way to save and release underage clips LEGALY!

Hello everybody,

Any advances on this? Or any official position by Mozilla Legal?

Thanks!