Word frequency lists

A consensus seems to exist on the idea that meaningless sentences are not an issue (except for the discomfort they cause to the reader).
This indicates that the intended ‘Speech to Text’ programs would not use sentence coherence but only the phonetics of words.

In this case, why not use simple lists of words,
such as lists arranged by frequency of use in each language ?
(for French, the Ministry of Education publishes such lists :

I suspect that reading individual words is less fun (or not fun at all) than reading sentences, also takes more time to gather a lot of hours of voice, while right now each sentence clip is giving us 4-8s of voice.

Also there are some words not present in frequent words list (like brands) that are used in everyday speech recognition.

The goal is to get natural speech, which is easier with complete sentences.

Also, if Common Voice can assemble a high-quality sentence dataset, it could potentially be used for sentence coherence in future, or for a range of other purposes. I tend to view CV as two separate datasets: text and voice.

Well I do agree with your both answers, but are we sure that the sentences we gather cover all the so called frequent words ?

In other words, can we rely on random to properly scan a language ?

No, but the way to increase the proportion of missing or low-frequency words is to search for natural sentences that include them within a large corpus - eg public domain books. Then, pull out the sentences that include the needed words.

I have a script to locate words with little-to-no coverage. You can then submit sentences targeting those specific words.

1 Like