Is there an upper limit for one person?

I have been told that one person shouldn’t make a lot of recordings because that might bias the data.

For the Georgian project, there aren’t that many volunteers. I’m on the top of the leaderboard with around 800 recordings and second place has around 400. And the community hasn’t been active lately nor do I expect this to change any time soon.

My question is should I continue making recordings, or shouldn’t I for the reason mentioned above?

Your time would be better spent rather than recording in asking other people to record. At this point (and these numbers are approximate) one person recording 5 recordings is probably worth you recording 100. There is no upper limit, but more voices are definitely better than a lot of a single user.


There is no limit, if you check the top list on the dashboard, you can see that the top 10 donors all donated over 45 000 sentences.

I belive that for languages with small datasets a single active contributor can make a big difference because this person assures that you can create a useful dataset and a first model at all. Bu this model will only be useful for this one person of course. You have to work on the generalization later, but it can be a good proof of concept.

So, once your dataset is becoming big enough to really create models, it is much better to focus on diversity. But I also made the experience that having active donors and a permanently growing dataset is very useful for the morale of the other donors and keeps the community running. Even motivated people are quickly leaving the project if there is nothing to validate.

Another possible thing to do instead of recording is expanding the sentence corpus. This especially important for small languages.


Can you point me to some recent research/paper which has quantitative analysis?

Thanks for feedback. I’d better work on involving more people to the cause.

1 Like

@Razmik-Badalyan, if you didn’t see yet, there is a global campaign for this coming. Perhaps you would like to join it for this purpose… More info here:

1 Like

I don’t know of one, but it would be an interesting experiment. You can look at some of the numbers in the technical report I wrote. Particularly Table 1 gives the number of hours of training data and number of speakers and Table 2 gives some baseline results. In general you can see that the models with few speakers in train perform poorly. I also have anecdotal data from other experiments, but no hard numbers.