Need Common Voice admin help with a volunteer

bozden · December 6, 2023, 4:34pm

At first, I was also thinking like you do, but my models say otherwise, as I explained before.

It might seem a waste of volunteer time for now, but please be aware that dataset-building projects like Common Voice have long spans, measured by years.

Accuracy in machine learning models gets better with exponentially more data. Just to give some hypothetical numbers: You can drop WER from 50% to 40% with +100h of recordings, but to go from 20% to 10% you would need +1000h, much more if you need to go from 5% to 3%.

Georgian has ~4M native speakers, and you have 1254 different (?) voices, which is a good number (at least much better than our sample size for Turkish). You’ll probably could not reach 1% of the population (40.000) with campaigns etc, so you will increase that gradually, but having those people record more will be more important. As I explained previously, more data is better.

One can easily use 5k recordings from Nemo in training. In a couple of years, many people from the community will also reach 10k+, so her recordings will be used more, and they will not be wasted (except bandwidth perhaps). And probably she will quit after some time, but her contribution will live…

We will never be able to get 1M different people recording diverse sentences (ideal case). So we need 1-2000 people to record 1000s of sentences and try to enlarge the voice diversity.

Topic		Replies	Views
Too many recordings? Common Voice	3	844	August 3, 2023
Is there an upper limit for one person? Common Voice participation	6	986	September 23, 2021
Single Sentence Record Limit feature release Common Voice announcements	18	3078	June 13, 2022
Volunteer to help to add Sanskrit and Kannada languages in the Common Voice project Common Voice participation	2	1018	December 16, 2020
Streamlining Localization and Reducing Barriers for Common Voice Communities Common Voice	3	572	May 22, 2024

Need Common Voice admin help with a volunteer

Related topics