Common Voice improvements ideas

Piotrc · July 27, 2021, 3:30pm

Few things to improve (Common Voice) user experience and data quality. I don’t know where to send these ideas, let me post here:

A background noise is very useful for speech analysis.
Can we add from time to time (eg. every 30 utterances) recording (10s) of just a background noise?
When recording, the 10-second limit seems too low for longer prompts. How about the rule that this limit varies between 10-15 seconds and depends on the length of the text. E.g.

Text with 5 words => limit is 10s
Text with 12 words => limit is 12s
Text with 15+ words = limit is 15s

When recording, showing the actual sound power level / graph (in DB) would be very helpful in seeing if the microphone has the appropriate sensitivity.
During listening, sometimes the recording is just silence.
I’ve noticed two potential reasons:

It is a silence only - than we can have detector for that, no need to annotate. If you need help, I can provide command line tool to detect most of silence recordings.
Web browser didn’t cached audio properly, but it lets me press “Play”. This seems to be a sign of some bug, please double check. It happens more often when I had slow connection on Android mobile phone.

Showing the sound power graph (in DB) while listening would improve the judgment if it is too quiet or beginning non-speech is quite long and actual speech is delayed.

Showing a graph of the audio power while recording would avoid a fairly common problem that people start talking a bit too fast. And the first word is truncated.

Regards,
Piotr

Codigo_Logo_Programacao_e_Inteligencia_Artificial · July 27, 2021, 8:44pm

I think that the background noise can be added in the training phase in the data augmentation process, maybe the audio should be normalized to improve validation.