I’ll quote your message there for reference
Recently a change was made to the site to list sentences with the fewest recordings first in order to add more unique sentences for the DeepSpeech model. I think that this was a good idea overall, however I’m starting to see something that could be a problem.
Some users are recording a LOT of sentences. In fact, over the past few days I have validated around 1500-2000 clips and I would estimate around 70% of them were recorded by the same user, all of which were unique sentences.
I’m sure that the DeepSpeech team makes certain that there aren’t too many recordings by a single user, so these sentences will most likely be discarded until there are more recordings available. But if the site shows sentences with the fewest recordings first, it will have to go through the thousands of unrecorded sentences to get to that point again, which may never happen if more sentences keep getting added.
The DeepSpeech team said they don’t want more than a few hundred recordings from any one user. So a user with 5000 recordings may have prevented 4700 sentences from making it into the model.
So I think the solution to this is either:
Put a hard limit on the total number of recordings users can make or have a daily per-user limit.
Change the algorithm so that each sentence has, say, 3 recordings minimum before it’s given a lower priority in the queue.
In the coming weeks we will be working on a few experiments involving personal goals and also invite more people to contribute since, as we have commented in the past, diversity is super important.