English has reached its goal of 1200 hours of validated hours. What will be the next goal?

English is the first language that has reached the goal of 1200 validated hours. Congrats, that’s a great accomplishment! Maybe even a reason to party a little?

On the language page the diagram looks like this:

It looks like the diagram is stuck like this for now. What will be the next goal? I think the best would be to always jump 1000 hours up to the next round number, so the next goal could be 2 000 hours. This would keep the diagram closer to the other languages than, say, 10 000 hours. It would also be great if you could see the exact numbers when you hover over the numbers like on the diagram on the main page.

What are your thoughts?

1 Like

Brilliant milestone indeed – congrats everyone! I’d say 2500 hours makes sense incrementally. Takes us to 2x of current progress and well on the way to a healthy baseline dataset size for a single language. @rosana I’m curious if you have input as to what our next dataset size goal should be?

1 Like

Aside from size goals, it would be nice to try setting quality goals, in my opinion.


@xorgy agreed and 2020 is indeed the year we’re focusing on quality of the dataset, not only size. We’re working to understand what ‘quality’ is and will then be determining that criteria to level into. I’d expect you all to see some communication about that once we have our quality strategy proposal in place. For now I’ll recommend that we set 2500 as the next hours collection goal and re-evaluate our overall quantity to quality ratio once we have quality criteria in place.


Sounds good :slight_smile: A nice first step could be that reported sentences actually get filtered out instead of just writing a list of the reports. If you are afraid of abuse one could filter out sentences after they got reported two or three times. But it makes more sense to filter reports out first and then put wrong reports back into the database manually than the other way around. I’ve heard quite often that it is very demotivating for people if they reported a sentence and it shows up again and again.

I’ve spotted a new trend recently: multiple people deliberately reading sentences incorrectly. (For example, if the sentence is “what is the time?” they may say something like “what is the goose?”)

If people are trying to taint the results at the recording end then people must also probably be doing so at the validation end.

I feel like a weighting system may boost quality by weighting people with no account or few approved recordings / validations lower and weighting more experienced users higher.

So for example, if the threshold is two positive votes and you weigh new users at 0.75 and experienced users at 1.25, three new users are required to validate a clip, or one new user and one experienced user, or two experienced users.


I’ve seen that too and it is very annoying. But I think that this kind of sentences can be easily filtered out once you have got a first neural network. They will have a much lower detection rate.

I read the initial question also as: What can we do with the dataset since we reached the goal? Is it actually usable to train DeepSpeech? How good is the accuracy?

1 Like

Good point. It is now certainly good enough to launch a first beta of Firefox voice in English.

From my understanding 1200 hours are the minimum for a somewhat usable system and you can expect a good system with 10 000 hours. So 1200 hours is the number for a MVP.

But I believe these numbers will vary a bit from language to language.