I would like to make a distinction here between words that are not spoken in the corpus, and phonemes that are under-represented in the corpus.
Different ASR systems - Common Voice is intended for ASR, even though it is bastardised into other speech technologies - use different approaches, such as character recognition then word prediction (so something like a CTC beam search decoder, with a language model applied, as DeepSpeech does).
What I have not seen in corpus analysis of Common Voice is:
- A character analysis of which characters and character combinations that represent phonemes -
gh
, th
, ck
, ch
etc - are in the dataset.
- A phonetic analysis of phoneme frequency in the dataset.
For example, I suspect but have not proven, that Common Voice over-represents the th
sound - the /θ/
phoneme, and under-represents phonemes like /dĘ/
- the j
sound in jog
or jumper
.
If we are considering creating sentences that include new words - that will have an impact at the language model or word error rate level, we should also analyse the character and phonetic distribution of the corpus to ensure that we have good coverage at this level also.
I am thinking here of the Harvard Sentences.