Contributing many sentences that contain not yet spoken words

bozden · June 13, 2023, 7:16am

Your text corpus has many sentences, but not all of them are recorded yet. An important issue is that there are many multiples if you normalize them.

18.7% is a high value IMHO.

In the validated.tsv, you have only 53,772 different sentences, so, only a small portion of your text corpus has been covered (20.8% of the normalized unique sentences).
Total tokens (normalized) in all text-corpus is 48,009, probably you’ve got your 28,840 from validated.tsv file.

Topic		Replies	Views
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2278	March 15, 2023
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	354	February 28, 2025
I can't speak sentences in portuguese. There is no phrases for the language Common Voice participation , sentence-collection , feedback , issue , dataset	3	992	August 31, 2023
Spoken language vs written language in Tamil Common Voice sentence-collection	9	2913	November 1, 2019
Sentences analysis on main languages - Action needed for the ones with deficit Common Voice sentence-collection	14	1973	August 6, 2019