I understand and share some of the worries but not some of the others. I’ll try to dig into each of them separately.
legal (US) stuff
Making CommonVoice a more enjoyable thing (for both adults and children) does not make it illegal by itself.
There is even an 4-19 age-category in the profile and ~4% kids voice in multiple languages ¹ so far. They are many compelling points in terms of education, language learning, … It’s up to Mozilla US to clear up that legal stuff once for all.
The project does not explicitly forbids minors currently and if were to happen, the next step would be a porn-like advisory popup before users record/publish their own voice ![]()
Aim of conversational recordings
Common Voice aims to provide conversational recordings - recordings of sentences, that can and do occur in everyday speech.
I’m sorry I’ve to disagree. The FAQ states (and I totally agree with this):
The goal of the Common Voice dataset is to enable anyone in the world to build speech recognition, speaker recognition, or any other type of application that requires voice data. A voice assistant is just one of many types of applications you could use the dataset to build.
The dataset could help transcribing automatically theater records, radiophonics or TV archives or interviews and conferences and basically any human voice content recorded since the invention of the Phonograph. It could help for human-machine interface (and not only conversations but programming… or gaming). Advances in transcription could in turn provide advances in alignment leading to broader (aligned) dataset and subsequent improvements in voice synthesis…
By the way, voices are currently collected via users reading written text (with an imperfect punctuation and no context) which is already quite distinct from conversational recordings.
required sentence and recording length:
Books/articles sentences are longer and more elaborated and we all agree that having short sentences is (currently) necessary for machine-learning.
But:
- How much of this unusable sentences exists in an large text? I guessed 5% or 10% and you rightfully disagree. But I can’t argue nor prove about any number. Some testing must be done to see how (in)adequate books or articles are.
- But it’s perfectly possible to forcefully cut long sentences in order to force shorter clips: Think about a cursor highlighting the region to read and could very well stop at the next comma, colon, … waiting for the user to send or confirm it.
- This percentage of usable sentences within a book will only increase over time. For example, advances on DSAlign may very well allow longer sentences in the future. Having the clips already recorded won’t be bad at all. Still, it could be to know from @Tilman_Kamp if any unsurpassable hard limit is already known.
ordered sentences create significant bias on all parties involved
I failed to understand that point. Did you meant a reviewer would tend to read the article/story as well instead of concentrating on the actual clip quality and validation? If yes, I’m not sure about that but even if true, books-originating clips could be randomized before being submitted to the review’s queue.
“real world” recording
About your second worry of “real world” recording (I kept last because I think it’s a concerning one): real world data are desired for robustness of the training dataset. Robustness is welcome but comes at the price of many more recorded hours.
-
It’s likely that book or article reading may keep users talking longer than a typical sentence-reading session (what is good for the project) but I doubt conditions to be so much different for these users.
- I don’t share the assumption these clips would be “too perfect”. But I believe they would be more natural unlike current context-orphaned sentences.
- But even if they were (clearer, better pronunciation, less noise, …) it would not harm the training dataset at all.
-
Absence of noise is not a blocker. I remember having read research papers successfully adding artificial background noise to strengthen their final STT. I’m pretty sure it could be done as a post-recording (or pretraining) filter. It’s easier to add than remove.
[Next commonvoice campaign: Record only your background noise
] -
Last but not least, it’d be easy to log whether a sentences comes from a single-sentence or book-mode, so it’d be possible to balance the training dataset to provide either more “good quality” records or more “noisy records” and actually determine what an ideal balance looks like.
Overall, bringing new audiences able to give hundreds if not thousands more hours of voice is an infinitely superior benefit.
Major blocker
There is major blocker not mentioned: The cost of adapting the existing platform (UI, database structure, …)
If it’s deemed too hard, I think the concept could be advantageously implemented using a browser extension
When activated (in reader-view), it could start selecting regions of text, sentence after sentence, and provide the usual “record/stop/send” icons. Once recorded, continue and highlight the next adjacent sentence.
One issue of not using the commonvoice website would be the inability to control the license of the content users read and upload. But this could be mitigated by hardcoding a restriction domains-list known to provide CC-0 contents or page having the CC-0 in their HTML).
¹ IMHO a 4-19 years old category is a mistake. Splitting in two categories before/after 14 would be a more useful voice-classification machine-wise.