Common voice sentences are the opposite of "common"

nukeador · February 3, 2020, 9:56pm

Children voices is currently a legal limitation. About the rest, I agree, there are currently some of this open questions we should evaluate when checking the quality of our dataset as well as how models trained with it perform in the real world.

/cc @rosana because of the insightful comments about quality and diversity.

david-song · February 3, 2020, 10:05pm

What’s the legal issue with using children’s voices? Is it that you need permission from a third party (legal guardian)? Would it not be possible to get schools onboard, get them to use a separate site to submit clips? Or can the recognition models be composed, so the actual voice data isn’t shared, only the histograms that form the training data? That would remove GDPR/privacy liabilities, if not copyright ones.

dabinat · February 4, 2020, 9:39am

In many countries the data of minors has more legal restrictions than those of adults.

By the way, I have an ongoing project to clean up the English wiki sentences, focusing mainly on removing difficult-to-pronounce foreign and scientific terms. Although I’ve really only just scratched the surface, it is slowly improving things.

nukeador · February 4, 2020, 10:15am

See this topic for reference

Filippo_Davalli · February 12, 2020, 1:50pm

Here a true idealistic proposal for easy collecting common voice sentences.
Think about sharing Alexa voice history.
Ask Amazon through change.org to add a button “I want to donate my voice history to CommonVoice” or at least “download the archive of my voice history”.
Now you can actually only delete or play the history.
Is it possible to realize a browser addon able to download voice data from alexa account?

cjbaker · February 12, 2020, 8:16pm

Our current data is what’s called a “read speech corpus”, because each sentence is read from a prompt. It can also be very useful to have a “spontaneous speech corpus”. In this case, the speech is produced spontaneously, and the transcription is created later on by listening to it.

I have long thought that that the creation of a spontaneous corpus would also be an ideal application for crowd-sourcing, not sure if it could one day be included in the scope of this project. You would initially contribute a recording of your voice, then other users would verify and transcribe it. Your Alexa voice history might be good for that, or also just recording your side of (phone) conversations, etc.

twinfrosty · September 7, 2024, 1:59am

You can get lots of casual english sentences from fandom wiki, which uses CC-SA. They’re mostly on media/entertainment topics.
https://www.fandom.com/licensing

bozden · September 7, 2024, 9:19am

Hey @twinfrosty, welcome. In Common Voice, only CC-0 sentences are allowed.

But a new Spontaneous Speech dataset creation application is under development, where people give spontaneous answers to prompts and then transcribe it.