Why are the first 10 words the same?

Razmik-Badalyan · February 15, 2024, 9:12pm

I have heard from developers that use Georgian dataset that many clips are useless, because it’s the recording of the same 10 words again and again.

Actually its more than 50% of the current Georgian dataset.

Why not use different words for the first 10 ones? Or why use single words at all?

bozden · February 15, 2024, 11:13pm

Hey @Razmik-Badalyan, let me try to understand and explain… You are saying that some single words got recorded many times right? If this is what we are talking about the statistics you shared above has nothing to do with it.

I don’t know what “single words” you are talking about, but I think it is safe to assume that it is from the “Single Word Sprint” that is done in 2020, containing the words “zero” to “nine”, “yes”, “no”, “hey” and “Firefox”. In some languages hundreds of people recorded these (as it should be). When you train an AI which understand simple numbers (e.g. a phone number spelled as digits), you would need these recorded by hundreds of different voices. Here is some info on this: Common Voice Single Word Sprint - Mozilla Community Portal

These are labeled as “benchmark” under the “segment” column in the dataset metadata files. This is a snapshot from Turkish validated, see the word “Benchmark”:

To see this, you should look at the “sentences” tab in the Common Voice Dataset Analyzer:

So, more than 100 people recorded 12 such single words, resulting in >1200 recordings (out of your about 85k clips). If you don’t want them, just use a limiting code (e.g. using Corpora Creator with -s 5 parameter). Also only taking longer recordings (e.g. >1500 ms) would eliminate many single words I think - if you don’t want these… I’d not know if they would create a transcription bias without extensive testing thou.

As these are common words in any language, I would say no… I also think single words questions/answers are part of our conversations, but as a general rule you want to train a model with data for your specific purpose. E.g. if you are transcribing a conference (i.e. non-conversational, with long sentences) you might want to remove these - as newer models work better with recordings 5-25 sec.

If someone is dumping a dictionary (which doesn’t look like it), it will be bad of course. If you look at the “text-corpus” tab, you will see there are only 16 single word sentences, most of them from the benchmark (disclaimer: These values are not exact after v13.0, because we cannot analyze the sentences entered from the new web interface as they are not released - but I know the team is working on it):

But, I think, Common Voice being a generic dataset for many purposes, these single words / shorter sentences will be very valuable for simpler limited vocabulary / appliance command models for example.

One other thing: It is actually the duration of the training set that matters! You have (say) 1800 of these (say) 1 sec recordings, that would total 1/2 hour, but you have 120+ hours in validated…

As per Recording Per Voice stats under Voices tab - which is why the developers are leaving out 50% of the dataset:

You mentioned in another discussion, that some people record too much and those extra recordings got not included to the training to prevent voice bias. It might really depend on your application, your model architecture (e.g. Whisper) and your workflow (e.g. fine-tuning vs training from scratch). I cannot say anything where to threshold without actually creating a model and testing it against the same test set - which must be diverse. 15? 100? 300? If 15 recordings/person are taken at max, you would lose many recordings. Again, not possible to deduct the number without testing. My experiments for Turkish, fine-tuning from Whisper multi-lingual models perform generally best if I take the whole dataset split with my v1 algorithm, but others might not agree (different language/model/methodology etc).

Razmik-Badalyan · March 2, 2024, 7:32pm

Hi @bozden, thank you for your answer and sorry for the late reply.

Now that I look at it again, I see that I misinterpreted the graph. The % is of the voices/volunteers not clips.

bozden · March 2, 2024, 8:37pm

Oh, that’s it then The % in frequency graphs show how much of the population (in this case voices) is in that bucket.

Topic		Replies	Views
Single word utterances better than sentence? Common Voice sentence-collection	1	526	August 28, 2020
Do the Common Voice datasets contain multiple audio samples for the same text in the same language? Common Voice dataset	9	2169	April 20, 2020
Many single words in data set (UA) - is that OK? Common Voice sentence-collection	2	809	July 5, 2021
Single Sentence Record Limit feature release Common Voice announcements	18	3078	June 13, 2022
Too many recordings? Common Voice	3	844	August 3, 2023

Why are the first 10 words the same?

Related topics