If you don’t mind, some comments on this… Here is your v15.0 recs/voice (validated):
- Less than 5% of people recorded more than 128 sentences.
- Most of them just recorded 5 sentences and moved away (they tried!).
- Your average is ~63 sentences, with the top performers’ effect, if you leave them out, it will drop.
This will always be similar, check other languages. These few recordings are also valuable, because they usually go to test & dev splits. And the top performer voices go to the train split.
10 000 hours is
Don’t aim for the 10k, we have fine-tuning and transfer learning now, and 10k was an old figure. I’d say, go stepwise, 100 - 200 - 500 - 1000 - 2000 hours. You might like to put yearly targets this way, far away targets make people scary. You are past 100h, so select 200/250/300 for 2024 for example.
Your avg. rec duration is about 5 sec, so for 1000h (say end of 2025), you would need 650k new recordings, ~200k new sentences, 2-3k new voices (300 recordings on average each). Don’t forget, it will again be like a normal distribution…
I’d advise for longer sentences thou. Try to get some longer recordings, SotA models work best with 5-25 sec recordings.
My bet is on influencers.
We were not successful with this, but AI was not a thing at that time… I hope you’ll get better results.
Georgian language day
That was successful with us, try to start the campaign teasers 1 month before.
As far as I know, TTS training requires studio-quality clean recording
That was in the past. As @kathyreid emphasized, there is VALL-E now. This is a pre-trained base model and with even a small amount of data (3 seconds!!!), you can finetune it to other voices. High quality and longer will be better of course, but clean (no cracks, background sounds etc) CV recordings can easily be used for it.
Because of TTS-based scams, they removed high-quality models from public access, but they are there and people do replicate the science.