Common Voice improvements ideas

Few things to improve (Common Voice) user experience and data quality. I don’t know where to send these ideas, let me post here:

  1. A background noise is very useful for speech analysis.
    Can we add from time to time (eg. every 30 utterances) recording (10s) of just a background noise?

  2. When recording, the 10-second limit seems too low for longer prompts. How about the rule that this limit varies between 10-15 seconds and depends on the length of the text. E.g.

  • Text with 5 words => limit is 10s
  • Text with 12 words => limit is 12s
  • Text with 15+ words = limit is 15s
  1. When recording, showing the actual sound power level / graph (in DB) would be very helpful in seeing if the microphone has the appropriate sensitivity.

  2. During listening, sometimes the recording is just silence.
    I’ve noticed two potential reasons:

  • It is a silence only - than we can have detector for that, no need to annotate. If you need help, I can provide command line tool to detect most of silence recordings.
  • Web browser didn’t cached audio properly, but it lets me press “Play”. This seems to be a sign of some bug, please double check. It happens more often when I had slow connection on Android mobile phone.

Showing the sound power graph (in DB) while listening would improve the judgment if it is too quiet or beginning non-speech is quite long and actual speech is delayed.

Showing a graph of the audio power while recording would avoid a fairly common problem that people start talking a bit too fast. And the first word is truncated.



I think that the background noise can be added in the training phase in the data augmentation process, maybe the audio should be normalized to improve validation.

1 Like