Sentences analysis on main languages - Action needed for the ones with deficit

Hi everyone,

I’ve been running some math with the 15 locales who have more validated hours today, I’ve calculated how many sentences they have and how many validated hours.

Based on this we can estimate how many sentences we need to cover the current hours without making any repetition (one recording per sentence).

Sentences difference: Numbers in negative indicate that we will need at least that number of sentences more to cover the current hours (so potentially we would need more to get more clips without repetitions)

Additional hrs we could accommodate: Numbers in negative indicate we already have that numbers of hours of repeated sentences, that won’t be used by deep speech training.

Locale Current hours Current sentences Sentences to cover current hrs Sentences difference Additional hrs we could accommodate
English 880 1392395 633600 758795 1053,88
German 390 1412583 280800 1131783 1571,92
French 218 2130572 156960 1973612 2741,13
Spanish 44 1178931 31680 1147251 1593,40
Chinese (China) 14 53164 10080 43084 59,84
Kabyle 260 35715 187200 -151485 -210,40
Catalan 140 33622 100800 -67178 -93,30
Persian 87 6005 62640 -56635 -78,66
Chinese (Taiwan) 54 4827 38880 -34053 -47,30
Welsh 54 1470 38880 -37410 -51,96
Basque 45 6262 32400 -26138 -36,30
Russian 38 10787 27360 -16573 -23,02
Italian 36 12283 25920 -13637 -18,94
Tatar 28 17814 20160 -2346 -3,26
Dutch 24 5249 17280 -12031 -16,71

If you are in a locale with a negative difference, I would encourage to prioritize mobilizing your communities and networks on getting more sentences, through the sentence collector or getting some technical people to review our recently published script to mass extract sentences from wikipedia.

We need to avoid people keep recording the same sentences over and over again.

Thanks everyone for your support! :slight_smile:


Update: Edited to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.

Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.

The more clarity we gain, the better we can design and further develop data collection via Common Voice.

To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.

We keep learning :slight_smile:


3 Likes

It’s worth mentioning that a large proportion of the 880 hours of English are duplicate sentences recorded prior to Sentence Collector / Wikipedia import. So English can probably accommodate a lot more hours than calculated.

You are right and I’ve asked to also check this, but it’s not as urgent as other languages since we have already enough sentences in English to keep accommodating recordings.

Cheers.

@nukeador, quick question here

I’m part of the Catalan community, and we’d love to have known (before) that out of the 140 hours that we’ve recorded, we won’t be able to use 93 hours because they are duplicated… So now we know we need more sentences.

But why does CommonVoice proposes sentences that already have recordings? Does it also happens even if “non-used” sentences existed?

Meaning: will English have the same problem, where from the current hours (880) some of them may not be useful because are duplicated?

We haven’t put a hard-lock on the site to recordings because some communities asked to be able to keep collecting voices for other uses (not Deep Speech) and also it’s been just recent (late last year) since we got more clarity on the number of recordings per sentence needed to train models.

That’s why we have been always pushing for communities to get more sentences (sentence collector and wikipedia extraction work). Recordings for the same sentence are not lost (they are part of the dataset), it’s just that are not as useful as recording a new sentence.

We are going to be working with the Deep Speech to fully understand this and propose the necessary changes that help their goals as well as balance that with what communities are asking for.

What do you mean by “non-used” here? For these languages with deficit, all sentences have been already recorded at least once.

Cheers.

By “non-used” I meant sentences without recordings.

Rephrasing my question: is it possible that in languages like English there are sentences that have more than one recording while there are sentences that have zero?

Oh, absolutely. This is definitely the case with English. Common Voice operated with a limited text corpus for over a year before Sentence Collector came along. So there’s a high chance that anything prior to the beginning of 2019 (when SC launched) has a significant number of duplicates.

The latest dataset offers 770 hours of English but once you filter out the duplicates you end up with a tiny fraction of that. I can’t remember the exact amount but I think it was in the ballpark of 130 hrs or 17%. That’s a pretty bad ratio, but there’s since been another 100 hrs or so of Sentence Collector / wiki sentences recorded, which should be reasonably low on duplicates, so that percentage should improve in the next dataset release.

That was my point here: shouldn’t common voice web app make it so it’s zero, instead of low?

IMHO, if Deep Speech can only use one recording per sentence, that’s what common voice should try to get, and only offer an already recorded sentence once all sentences have been recorded (and validated) at least once

1 Like

It should ideally be zero but that’s probably pretty difficult to achieve. There will always be a margin of error due to the potential for race conditions - two people recording the same sentence at the same time, someone recording a sentence after leaving their browser open for ages, etc.

Plus there’s this issue:

I’ve edited the first post to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.

Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.

The more clarity we gain, the better we can design and further develop data collection via Common Voice.

To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.

We keep learning :slight_smile:

2 Likes

Hi, for kabyle, I have more than 270 000 sentences. I’m looking for kab contributors to help me check before sending them to Sentence collector.

1 Like

Great. I would advise to check them using the sentence collector itself, so you don’t have do to duplicated work.

Cheers.

1 Like

Ok, I’ll upload them. Is there any way to upload them quickly. I splited the corpus to files of 5000 sentences (55 files).

1 Like

I don’t know if there is a limitation for sentences submission you can c&p them on the tool, @mkohler can better tell.

What I know is that only 10K are presented for review at a time.

1 Like

Yes, I noticed that only 10K sentences are presented. We can’t see how many sentences are queued.

1 Like