Are 10 000 hours of recordings necessary for every language?

For some smaller languages the 10 000 hour aim in the common voice project is very ambitious. How usable are smaller datasets for machine learning? How important is the number of irregularities in a language in this context?

Let’s take an extreme example and look at the Esperanto dataset with 20 h. The language is completely regular, has no exceptions and the pronounciation is always clear. How big must a dataset be to create usefull results in this case? And apart from constructed languages: are there differences between the natural languages or do they all need the same amount of data?

Edit: 10 000 hours instead of 100 000 and mentioned the common voice project.

Pretty sure than @ftyers knows that very well :slight_smile:

1 Like

Where did you get 100.000 hours from? We train our English release models with ~3000 hours of data.

1 Like

Interesting. And do you think these 3 000 hours are enough?

Sorry I was wrong, it says 10 000 hours. When you work on the common voice website it always says something like: “help us get a dataset of 10 000 hours for language X.” And it is also in the FAQs:

Why is 10,000 validated hours the per language goal for capturing audio?

This is approximately the number of hours required to train a production speech-to-text system.

edit: mentioned common voice

That’s Common Voice, and yes, the value is mostly made from targetting english. Each language has its own specifics. I don’t know esperanto enough to give a ballpark, francis might be able to.

1 Like

There are many factors to take into account, among which:

  • Transparency of orthography – Esperanto orthography is transparent
  • Diversity of speakers – Esperanto speakers have quite diverse accents (as there are nearly no L1 speakers, you get transfer from the L1)
  • Number of speakers – If you only have 10 speakers of a language and have recordings of all of them, your system is going to work better than if you have 10 million speakers and only recordings of 100.
  • Noise levels (how clean are the recordings, and how clean are the environments where you want to use the ASR system … the noisier the more data)

There is no research to my knowledge on comparing these factors to be able to tell for a new language exactly how much data is necessary.

The best thing to do is “try it and see”, and report your results. You can also try transfer learning, using the transfer-learning2 branch in DeepSpeech or using a hybrid system like Kaldi, which can give better results in low-data scenarios.

Hope this helps!

1 Like

Tanks for the answer, this sounds promissing, I would like to experiment with this. Do you have a good starting point to learn about deep speech apart from the github repository? Is this something I can achiev on a consumer laptop with an ordinary linux on it or do I need more computation power?

@ftyers, could you kindly say this in other words. Did you mean the less the speakers the more the accuracy?

For accuracy on those speakers yes. For general accuracy (on any random speaker) then the more speakers the better, obviously. But if you only have 10 speakers of a language and you have recordings of all 10 of them then obviously the results are going to be better.

@ftyers
If I need to train a model to be speaker-independent and age-independent, what is your suggestions to take into account?

I have a large dataset, with thousands of different speakers. However, there are speakers which have hundreds/thousands of recordings, while many others have only 1 recording. How should I deal with this?

Regarding age, I have much more adults’ recordings in comparison with child’s. Should I filter out some adults’ data for example?

The core of my question is: is the main part of a lingual neural network about handeling all the exceptions and irregularities of a language or is it more about handeling all the different voices and accents? Which one is the harder problem that requires more data? I can guess that since no one has ever tried machine learning with a completely regular constructed language there is no answer to this question yet. But I would guess that regularity could simplify things drastically.

About the other things @ftyers mentioned:

Diversity of speakers – Esperanto speakers have quite diverse accents (as there are nearly no L1 speakers, you get transfer from the L1)

I would say there are quite a few L1 speakers, around one thousand persons. But what makes Esperanto unusual is that in my experence native Esperanto speakers sometimes tend to have stronger national accents than someone who has learned Esperanto as a second language. I guess this is because there is no Esperanto school and they only use Esperanto in their family most of the year and only see the esperanto community on congresses. This leads to a situation where the nation of origin is more important in the question of accents than the question weather someone is a native speaker or L2. I know complete beginners of Esperanto who pronunce every word perfectly and could contribute to this dataset much better than some native speakers.

Number of speakers – If you only have 10 speakers of a language and have recordings of all of them, your system is going to work better than if you have 10 million speakers and only recordings of 100.

The commitment to contribute to projects is much bigger in the esperanto community than in other groups, one can clearly see this on things like the esperanto wikipedia. Languages with comparable numbers of speakers have much smaller numbers of articles. I guess if we advertise Common Voice more in Esperanto magazines, congresses and websites we could get a pretty good coverage, 1 000 of the 2 million L2 speakers seem possible to me. Right now there are contributions from over 140 people in Esperanto wich is already quite impressive for a constructed language.