Why are the first 10 words the same?

I have heard from developers that use Georgian dataset that many clips are useless, because it’s the recording of the same 10 words again and again.

Actually its more than 50% of the current Georgian dataset.

Why not use different words for the first 10 ones? Or why use single words at all?

Hey @Razmik-Badalyan, let me try to understand and explain… You are saying that some single words got recorded many times right? If this is what we are talking about the statistics you shared above has nothing to do with it.

I don’t know what “single words” you are talking about, but I think it is safe to assume that it is from the “Single Word Sprint” that is done in 2020, containing the words “zero” to “nine”, “yes”, “no”, “hey” and “Firefox”. In some languages hundreds of people recorded these (as it should be). When you train an AI which understand simple numbers (e.g. a phone number spelled as digits), you would need these recorded by hundreds of different voices. Here is some info on this: https://community.mozilla.org/en/campaigns/common-voice-single-word-sprint-2/

These are labeled as “benchmark” under the “segment” column in the dataset metadata files. This is a snapshot from Turkish validated, see the word “Benchmark”:

image

To see this, you should look at the “sentences” tab in the Common Voice Dataset Analyzer:

So, more than 100 people recorded 12 such single words, resulting in >1200 recordings (out of your about 85k clips). If you don’t want them, just use a limiting code (e.g. using Corpora Creator with -s 5 parameter). Also only taking longer recordings (e.g. >1500 ms) would eliminate many single words I think - if you don’t want these… I’d not know if they would create a transcription bias without extensive testing thou.

As these are common words in any language, I would say no… I also think single words questions/answers are part of our conversations, but as a general rule you want to train a model with data for your specific purpose. E.g. if you are transcribing a conference (i.e. non-conversational, with long sentences) you might want to remove these - as newer models work better with recordings 5-25 sec.

If someone is dumping a dictionary (which doesn’t look like it), it will be bad of course. If you look at the “text-corpus” tab, you will see there are only 16 single word sentences, most of them from the benchmark (disclaimer: These values are not exact after v13.0, because we cannot analyze the sentences entered from the new web interface as they are not released - but I know the team is working on it):

But, I think, Common Voice being a generic dataset for many purposes, these single words / shorter sentences will be very valuable for simpler limited vocabulary / appliance command models for example.

One other thing: It is actually the duration of the training set that matters! You have (say) 1800 of these (say) 1 sec recordings, that would total 1/2 hour, but you have 120+ hours in validated…

As per Recording Per Voice stats under Voices tab - which is why the developers are leaving out 50% of the dataset:

You mentioned in another discussion, that some people record too much and those extra recordings got not included to the training to prevent voice bias. It might really depend on your application, your model architecture (e.g. Whisper) and your workflow (e.g. fine-tuning vs training from scratch). I cannot say anything where to threshold without actually creating a model and testing it against the same test set - which must be diverse. 15? 100? 300? If 15 recordings/person are taken at max, you would lose many recordings. Again, not possible to deduct the number without testing. My experiments for Turkish, fine-tuning from Whisper multi-lingual models perform generally best if I take the whole dataset split with my v1 algorithm, but others might not agree (different language/model/methodology etc).

Hi @bozden, thank you for your answer and sorry for the late reply.

Now that I look at it again, I see that I misinterpreted the graph. The % is of the voices/volunteers not clips.

Oh, that’s it then :slight_smile: The % in frequency graphs show how much of the population (in this case voices) is in that bucket.