English has reached its goal of 1200 hours of validated hours. What will be the next goal?

stergro · February 12, 2020, 2:06pm

English is the first language that has reached the goal of 1200 validated hours. Congrats, that’s a great accomplishment! Maybe even a reason to party a little?

On the language page the diagram looks like this:

It looks like the diagram is stuck like this for now. What will be the next goal? I think the best would be to always jump 1000 hours up to the next round number, so the next goal could be 2 000 hours. This would keep the diagram closer to the other languages than, say, 10 000 hours. It would also be great if you could see the exact numbers when you hover over the numbers like on the diagram on the main page.

What are your thoughts?

mbranson · February 12, 2020, 5:14pm

Brilliant milestone indeed – congrats everyone! I’d say 2500 hours makes sense incrementally. Takes us to 2x of current progress and well on the way to a healthy baseline dataset size for a single language. @rosana I’m curious if you have input as to what our next dataset size goal should be?

xorgy · February 23, 2020, 11:50pm

Aside from size goals, it would be nice to try setting quality goals, in my opinion.

mbranson · February 24, 2020, 8:39pm

@xorgy agreed and 2020 is indeed the year we’re focusing on quality of the dataset, not only size. We’re working to understand what ‘quality’ is and will then be determining that criteria to level into. I’d expect you all to see some communication about that once we have our quality strategy proposal in place. For now I’ll recommend that we set 2500 as the next hours collection goal and re-evaluate our overall quantity to quality ratio once we have quality criteria in place.

stergro · February 24, 2020, 9:49pm

Sounds good A nice first step could be that reported sentences actually get filtered out instead of just writing a list of the reports. If you are afraid of abuse one could filter out sentences after they got reported two or three times. But it makes more sense to filter reports out first and then put wrong reports back into the database manually than the other way around. I’ve heard quite often that it is very demotivating for people if they reported a sentence and it shows up again and again.

dabinat · February 24, 2020, 10:29pm

I’ve spotted a new trend recently: multiple people deliberately reading sentences incorrectly. (For example, if the sentence is “what is the time?” they may say something like “what is the goose?”)

If people are trying to taint the results at the recording end then people must also probably be doing so at the validation end.

I feel like a weighting system may boost quality by weighting people with no account or few approved recordings / validations lower and weighting more experienced users higher.

So for example, if the threshold is two positive votes and you weigh new users at 0.75 and experienced users at 1.25, three new users are required to validate a clip, or one new user and one experienced user, or two experienced users.

stergro · February 25, 2020, 9:32am

I’ve seen that too and it is very annoying. But I think that this kind of sentences can be easily filtered out once you have got a first neural network. They will have a much lower detection rate.

davidak · March 3, 2020, 8:10pm

I read the initial question also as: What can we do with the dataset since we reached the goal? Is it actually usable to train DeepSpeech? How good is the accuracy?

stergro · March 5, 2020, 1:04pm

Good point. It is now certainly good enough to launch a first beta of Firefox voice in English.

From my understanding 1200 hours are the minimum for a somewhat usable system and you can expect a good system with 10 000 hours. So 1200 hours is the number for a MVP.

But I believe these numbers will vary a bit from language to language.

Topic		Replies	Views
Wir haben 1000 validierte Stunden erreicht. Danke! 🎉 Deutsch (de)	3	870	March 29, 2022
10,000 hours of validated speech Common Voice	4	2167	June 25, 2018
✅ June Validation Campaign: Enhance the upcoming dataset release! Common Voice announcements	4	3710	June 15, 2020
Questions about website stats Common Voice feedback	2	2093	February 3, 2019
Are 10 000 hours of recordings necessary for every language? DeepSpeech	10	893	August 16, 2019

English has reached its goal of 1200 hours of validated hours. What will be the next goal?

Related topics