Sentences analysis on main languages - Action needed for the ones with deficit

nukeador · July 25, 2019, 11:44am

Hi everyone,

I’ve been running some math with the 15 locales who have more validated hours today, I’ve calculated how many sentences they have and how many validated hours.

Based on this we can estimate how many sentences we need to cover the current hours without making any repetition (one recording per sentence).

Sentences difference: Numbers in negative indicate that we will need at least that number of sentences more to cover the current hours (so potentially we would need more to get more clips without repetitions)

Additional hrs we could accommodate: Numbers in negative indicate we already have that numbers of hours of repeated sentences, that won’t be used by deep speech training.

Locale	Current hours	Current sentences	Sentences to cover current hrs	Sentences difference	Additional hrs we could accommodate
English	880	1392395	633600	758795	1053,88
German	390	1412583	280800	1131783	1571,92
French	218	2130572	156960	1973612	2741,13
Spanish	44	1178931	31680	1147251	1593,40
Chinese (China)	14	53164	10080	43084	59,84
Kabyle	260	35715	187200	-151485	-210,40
Catalan	140	33622	100800	-67178	-93,30
Persian	87	6005	62640	-56635	-78,66
Chinese (Taiwan)	54	4827	38880	-34053	-47,30
Welsh	54	1470	38880	-37410	-51,96
Basque	45	6262	32400	-26138	-36,30
Russian	38	10787	27360	-16573	-23,02
Italian	36	12283	25920	-13637	-18,94
Tatar	28	17814	20160	-2346	-3,26
Dutch	24	5249	17280	-12031	-16,71

If you are in a locale with a negative difference, I would encourage to prioritize mobilizing your communities and networks on getting more sentences, through the sentence collector or getting some technical people to review our recently published script to mass extract sentences from wikipedia.

We need to avoid people keep recording the same sentences over and over again.

Thanks everyone for your support!

Update: Edited to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.

Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.

The more clarity we gain, the better we can design and further develop data collection via Common Voice.

To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.

We keep learning

dabinat · July 28, 2019, 5:32pm

It’s worth mentioning that a large proportion of the 880 hours of English are duplicate sentences recorded prior to Sentence Collector / Wikipedia import. So English can probably accommodate a lot more hours than calculated.

nukeador · July 29, 2019, 10:32am

You are right and I’ve asked to also check this, but it’s not as urgent as other languages since we have already enough sentences in English to keep accommodating recordings.

Cheers.

xavivars · July 29, 2019, 4:48pm

@nukeador, quick question here

I’m part of the Catalan community, and we’d love to have known (before) that out of the 140 hours that we’ve recorded, we won’t be able to use 93 hours because they are duplicated… So now we know we need more sentences.

But why does CommonVoice proposes sentences that already have recordings? Does it also happens even if “non-used” sentences existed?

Meaning: will English have the same problem, where from the current hours (880) some of them may not be useful because are duplicated?

nukeador · July 29, 2019, 4:56pm

We haven’t put a hard-lock on the site to recordings because some communities asked to be able to keep collecting voices for other uses (not Deep Speech) and also it’s been just recent (late last year) since we got more clarity on the number of recordings per sentence needed to train models.

That’s why we have been always pushing for communities to get more sentences (sentence collector and wikipedia extraction work). Recordings for the same sentence are not lost (they are part of the dataset), it’s just that are not as useful as recording a new sentence.

We are going to be working with the Deep Speech to fully understand this and propose the necessary changes that help their goals as well as balance that with what communities are asking for.

What do you mean by “non-used” here? For these languages with deficit, all sentences have been already recorded at least once.

Cheers.

xavivars · July 29, 2019, 8:53pm

By “non-used” I meant sentences without recordings.

Rephrasing my question: is it possible that in languages like English there are sentences that have more than one recording while there are sentences that have zero?

dabinat · July 29, 2019, 9:39pm

Oh, absolutely. This is definitely the case with English. Common Voice operated with a limited text corpus for over a year before Sentence Collector came along. So there’s a high chance that anything prior to the beginning of 2019 (when SC launched) has a significant number of duplicates.

The latest dataset offers 770 hours of English but once you filter out the duplicates you end up with a tiny fraction of that. I can’t remember the exact amount but I think it was in the ballpark of 130 hrs or 17%. That’s a pretty bad ratio, but there’s since been another 100 hrs or so of Sentence Collector / wiki sentences recorded, which should be reasonably low on duplicates, so that percentage should improve in the next dataset release.

xavivars · July 30, 2019, 5:00am

That was my point here: shouldn’t common voice web app make it so it’s zero, instead of low?

IMHO, if Deep Speech can only use one recording per sentence, that’s what common voice should try to get, and only offer an already recorded sentence once all sentences have been recorded (and validated) at least once

dabinat · July 30, 2019, 6:18am

It should ideally be zero but that’s probably pretty difficult to achieve. There will always be a margin of error due to the potential for race conditions - two people recording the same sentence at the same time, someone recording a sentence after leaving their browser open for ages, etc.

Plus there’s this issue:
https://github.com/mozilla/voice-web/issues/2128

nukeador · July 31, 2019, 10:48am

I’ve edited the first post to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.

Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.

The more clarity we gain, the better we can design and further develop data collection via Common Voice.

To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.

We keep learning

belkacem77 · August 5, 2019, 1:31pm

Hi, for kabyle, I have more than 270 000 sentences. I’m looking for kab contributors to help me check before sending them to Sentence collector.

nukeador · August 5, 2019, 1:33pm

Great. I would advise to check them using the sentence collector itself, so you don’t have do to duplicated work.

Cheers.

belkacem77 · August 5, 2019, 1:37pm

Ok, I’ll upload them. Is there any way to upload them quickly. I splited the corpus to files of 5000 sentences (55 files).

nukeador · August 5, 2019, 1:47pm

I don’t know if there is a limitation for sentences submission you can c&p them on the tool, @mkohler can better tell.

What I know is that only 10K are presented for review at a time.

belkacem77 · August 6, 2019, 5:09pm

Yes, I noticed that only 10K sentences are presented. We can’t see how many sentences are queued.