Hi everyone,
I’ve been running some math with the 15 locales who have more validated hours today, I’ve calculated how many sentences they have and how many validated hours.
Based on this we can estimate how many sentences we need to cover the current hours without making any repetition (one recording per sentence).
Sentences difference: Numbers in negative indicate that we will need at least that number of sentences more to cover the current hours (so potentially we would need more to get more clips without repetitions)
Additional hrs we could accommodate: Numbers in negative indicate we already have that numbers of hours of repeated sentences, that won’t be used by deep speech training.
Locale | Current hours | Current sentences | Sentences to cover current hrs | Sentences difference | Additional hrs we could accommodate |
---|---|---|---|---|---|
English | 880 | 1392395 | 633600 | 758795 | 1053,88 |
German | 390 | 1412583 | 280800 | 1131783 | 1571,92 |
French | 218 | 2130572 | 156960 | 1973612 | 2741,13 |
Spanish | 44 | 1178931 | 31680 | 1147251 | 1593,40 |
Chinese (China) | 14 | 53164 | 10080 | 43084 | 59,84 |
Kabyle | 260 | 35715 | 187200 | -151485 | -210,40 |
Catalan | 140 | 33622 | 100800 | -67178 | -93,30 |
Persian | 87 | 6005 | 62640 | -56635 | -78,66 |
Chinese (Taiwan) | 54 | 4827 | 38880 | -34053 | -47,30 |
Welsh | 54 | 1470 | 38880 | -37410 | -51,96 |
Basque | 45 | 6262 | 32400 | -26138 | -36,30 |
Russian | 38 | 10787 | 27360 | -16573 | -23,02 |
Italian | 36 | 12283 | 25920 | -13637 | -18,94 |
Tatar | 28 | 17814 | 20160 | -2346 | -3,26 |
Dutch | 24 | 5249 | 17280 | -12031 | -16,71 |
If you are in a locale with a negative difference, I would encourage to prioritize mobilizing your communities and networks on getting more sentences, through the sentence collector or getting some technical people to review our recently published script to mass extract sentences from wikipedia.
We need to avoid people keep recording the same sentences over and over again.
Thanks everyone for your support!
Update: Edited to prevent any notion that duplicate recordings are “not useful”, that’s not at all the case.
Deep Speech is in large parts still a research project. The team is in the constant process of learning and optimising how the “golden standard” training dataset looks like, for our own engine, but also to cater for the needs of the broader research community.
The more clarity we gain, the better we can design and further develop data collection via Common Voice.
To be very clear: All recorded and validated hours are valuable and will be included in the Common Voice dataset. We just want to incorporate feedback we’ve gotten and thus have been putting even more emphasis on sentence diversity and volume with tool adjustments, new approaches and CTAs like the above.
We keep learning