Esperanto-TTS

I’m now looking forward to an authentic Esperanto TTS, for buiding my own educational systems. Are there any group of people trying to build an Esperanto TTS engine out of this data set?

Normally how much of computing power is required, if I want to build a TTS? I’m thinking of a personal level investment of a GPU equipped PC within 20000 USD budget range. A single training session should not exceed two weeks, so that I can try multiple times within 6 months. Is it possible to build a decent TTS using this level of computing power?

#tirifto #Esperanto

Hey, AFAIK there is no group of people that work on this right now. I do experiment with it a little. There is a complete section in this forum about TTS, I already started a thread there about a Esperanto TTS: Esperanto TTS

The other thread is a little old. In the meantime I asked the mimic2-TTS project if they support the common voice dataset and there is a way to use this data: https://github.com/MycroftAI/mimic2/issues/44

But remember: right now we only have around 30 hours of recordings by 200 donors in the esperanto dataset. For a good TTS you need around 15 hours by only one person in a good quality. Maybe it would be a better approach to use free esperanto audiobooks from librivox as input data. This means some transcription work though.

But besides all this I am still planing to train a model with the CV dataset just to see the results. Maybe a complete regular language needs much less training data than a natural one, no one has every tested this.

I’m thinking of a personal level investment of a GPU equipped PC within 20000 USD budget range.

I assume you mean 2000 USD? This sounds like a good plan. You basically need a lot of RAM and a good GPU. Just search “gpu for machine learning” and you will find a lot of information. But you can also always rent a cloud-server to train your data.

A single training session should not exceed two weeks, so that I can try multiple times within 6 months.

AFAIK the training will not take more than a week, maybe even less. Right now with the small dataset it won’t even take a day I guess.

Moving to TTS category.