Esperanto TTS

Hey everyone,
looks like Esperanto has come a long way, we already have 20 hours of audio in the dataset, recorded by over 140 contributors. I am very impressed about your work and try to add some value to the dataset myself. Plus I talk about this projekt on r/Esperanto, Mastodon, duolingo and to bring more people to contribute in the project.

I am looking forward to see the first experiments with this dataset, because afaik no one has ever tried machine learning with constructed languages. I would guess that the regularity and the lack of exceptions leads to very good results, even with small sample sizes.

I would love to experiment with text to speech in Esperanto. If this leads to good results this could become a breakthrough for audiobooks in Esperanto. Does the Mozzilla TTS engine run on a consumer laptop without a fast GPU?

PS: some parts of the Common Voice website are not translated to Esperanto yet. I tried to help a little but unfortunately I am only an itermediate esperanto speaker and don’t want to translate longer texts. Does anybody want to help me translate the missing parts?

BTW: the article about Common Voice in the Esperanto Wikipedia could use some work, I will try to expand it this weekend.

@stergro Mozilla TTS can be run for inference on a laptop without a GPU (CPU will determine time to create the text but it’s fairly near real-time on my XPS 13)

One thing to note is that this data would be from multiple speakers and TTS has handling of multi-speaker as a work in progress right now (previously the speech data was from single speakers). Someone asked about a similar scenario in one of the issues a couple of months back:


@nmstoker Okay, hope multi-speaker setups are getting possible in the future. The dataset has pretty big contributions from single donors. Would it make sense to train the TTS with one of the donors who donated more than 1000 audio files? Or would this be to little to get good results?

I think it’s a little early to tell best approach and others are probably better placed to comment (eg @erogol ) but with the current single speaker approach I’m pretty sure 1k recordings would be far too little (assuming 5-10 seconds each) - I got vaguely intelligible results with around 5hrs but it only really got decent when my sample set got well beyond 10hrs.

Okay good to know. Maybe creating a good TTS voice in esperanto has to be a project of its own then (with a professional, selected speaker). I looked a little closer on the dashboard, some speakers donated almost 2 000 donations, so maybe there will be a possibility in the future.

I still do belive that a completly regular language drastically decreases the number of neccessary samples, but I guess I will have to try it out to really know how much this is the case.

I got vaguely intelligible results with around 5hrs but it only really got decent when my sample set got well beyond 10hrs.

You tried this with english, right @nmstoker ?


Yes, I tried with English (with a collection of samples I’d produced / recorded of my own voice)

