Hey, AFAIK there is no group of people that work on this right now. I do experiment with it a little. There is a complete section in this forum about TTS, I already started a thread there about a Esperanto TTS: Esperanto TTS
The other thread is a little old. In the meantime I asked the mimic2-TTS project if they support the common voice dataset and there is a way to use this data: https://github.com/MycroftAI/mimic2/issues/44
But remember: right now we only have around 30 hours of recordings by 200 donors in the esperanto dataset. For a good TTS you need around 15 hours by only one person in a good quality. Maybe it would be a better approach to use free esperanto audiobooks from librivox as input data. This means some transcription work though.
But besides all this I am still planing to train a model with the CV dataset just to see the results. Maybe a complete regular language needs much less training data than a natural one, no one has every tested this.
I’m thinking of a personal level investment of a GPU equipped PC within 20000 USD budget range.
I assume you mean 2000 USD? This sounds like a good plan. You basically need a lot of RAM and a good GPU. Just search “gpu for machine learning” and you will find a lot of information. But you can also always rent a cloud-server to train your data.
A single training session should not exceed two weeks, so that I can try multiple times within 6 months.
AFAIK the training will not take more than a week, maybe even less. Right now with the small dataset it won’t even take a day I guess.