From my understanding common voice is supposed to be ideal reading (not ideal in audio fidelity but in accuracy (and enunciation?) of the words reads) of know text.
With this knowledge, is there any chance we could get the “sub-par” data as different set in the future?
The benefits to me are to help with edge cases of how people actually say words. Either to generate purposely “flawed” tts or to help with real world stt (in which people may say a word “wrong” a lot).