nukeador
(Rubén Martín [❌ taking a break from Mozilla])
1
Hi everyone,
I’ve been running some math with the 15 locales who have more validated hours today, I’ve calculated how many sentences they have and how many validated hours.
Based on this we can estimate how many sentences we need to cover the current hours without making any repetition (one recording per sentence).
Sentences difference: Numbers in negative indicate that we will need at least that number of sentences more to cover the current hours (so potentially we would need more to get more clips without repetitions)
Additional hrs we could accommodate: Numbers in negative indicate we already have that numbers of hours of repeated sentences, that won’t be used by deep speech training.
Locale
Current hours
Current sentences
Sentences to cover current hrs
Sentences difference
Additional hrs we could accommodate
English
880
1392395
633600
758795
1053,88
German
390
1412583
280800
1131783
1571,92
French
218
2130572
156960
1973612
2741,13
Spanish
44
1178931
31680
1147251
1593,40
Chinese (China)
14
53164
10080
43084
59,84
Kabyle
260
35715
187200
-151485
-210,40
Catalan
140
33622
100800
-67178
-93,30
Persian
87
6005
62640
-56635
-78,66
Chinese (Taiwan)
54
4827
38880
-34053
-47,30
Welsh
54
1470
38880
-37410
-51,96
Basque
45
6262
32400
-26138
-36,30
Russian
38
10787
27360
-16573
-23,02
Italian
36
12283
25920
-13637
-18,94
Tatar
28
17814
20160
-2346
-3,26
Dutch
24
5249
17280
-12031
-16,71
If you are in a locale with a negative difference, I would encourage to prioritize mobilizing your communities and networks on getting more sentences, through the sentence collector or getting some technical people to review our recently published script to mass extract sentences from wikipedia.
We need to avoid people keep recording the same sentences over and over again.
Thanks everyone for your support!
Update: Edited to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.
Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.
The more clarity we gain, the better we can design and further develop data collection via Common Voice.
To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.
It’s worth mentioning that a large proportion of the 880 hours of English are duplicate sentences recorded prior to Sentence Collector / Wikipedia import. So English can probably accommodate a lot more hours than calculated.
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
3
You are right and I’ve asked to also check this, but it’s not as urgent as other languages since we have already enough sentences in English to keep accommodating recordings.
I’m part of the Catalan community, and we’d love to have known (before) that out of the 140 hours that we’ve recorded, we won’t be able to use 93 hours because they are duplicated… So now we know we need more sentences.
But why does CommonVoice proposes sentences that already have recordings? Does it also happens even if “non-used” sentences existed?
Meaning: will English have the same problem, where from the current hours (880) some of them may not be useful because are duplicated?
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
5
We haven’t put a hard-lock on the site to recordings because some communities asked to be able to keep collecting voices for other uses (not Deep Speech) and also it’s been just recent (late last year) since we got more clarity on the number of recordings per sentence needed to train models.
That’s why we have been always pushing for communities to get more sentences (sentence collector and wikipedia extraction work). Recordings for the same sentence are not lost (they are part of the dataset), it’s just that are not as useful as recording a new sentence.
We are going to be working with the Deep Speech to fully understand this and propose the necessary changes that help their goals as well as balance that with what communities are asking for.
What do you mean by “non-used” here? For these languages with deficit, all sentences have been already recorded at least once.
By “non-used” I meant sentences without recordings.
Rephrasing my question: is it possible that in languages like English there are sentences that have more than one recording while there are sentences that have zero?
Oh, absolutely. This is definitely the case with English. Common Voice operated with a limited text corpus for over a year before Sentence Collector came along. So there’s a high chance that anything prior to the beginning of 2019 (when SC launched) has a significant number of duplicates.
The latest dataset offers 770 hours of English but once you filter out the duplicates you end up with a tiny fraction of that. I can’t remember the exact amount but I think it was in the ballpark of 130 hrs or 17%. That’s a pretty bad ratio, but there’s since been another 100 hrs or so of Sentence Collector / wiki sentences recorded, which should be reasonably low on duplicates, so that percentage should improve in the next dataset release.
That was my point here: shouldn’t common voice web app make it so it’s zero, instead of low?
IMHO, if Deep Speech can only use one recording per sentence, that’s what common voice should try to get, and only offer an already recorded sentence once all sentences have been recorded (and validated) at least once
It should ideally be zero but that’s probably pretty difficult to achieve. There will always be a margin of error due to the potential for race conditions - two people recording the same sentence at the same time, someone recording a sentence after leaving their browser open for ages, etc.
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
10
I’ve edited the first post to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.
Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.
The more clarity we gain, the better we can design and further develop data collection via Common Voice.
To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.