I’m keen to discuss what people have been considering in regard to data and training approaches to improve voice quality (naturalness of audio) and overall capabilities.
I’ve read wiki Dataset page and played around with the notebooks and they were helpful. I also realise a big improvement comes from increasing the size of my dataset (it got radically better between 6 hrs and when I got it to 12-13 hrs) and am pushing on to increase that further, but I also wanted to think about ways I could direct my efforts best.
The phoneme coverage as mentioned on the wiki seems critical, so I’ve started getting stats to show how well (or poorly!) my dataset represents general English speech. And I’m also looking at how well the Espeak backend converts the words in my dataset to phonemes (since if it has words that are either wrong or markedly off my dataset pronunciation, it’ll undermine the model’s ability to learn well)
One area I’m particularly keen to hear the thoughts of others on is whether there’s any advantage to the following:
- Initially training with a much simpler subset of my data
- Then fine-tuning with a broader set
or
- Whether it’s best just to start with everything from the start.
My (naïve) intuition here is that babies start with simple words and build up. I could probably limit the length of training sentences to those with under a certain short length of characters or better still single short words (although my dataset probably has those a little skewed as I’ve not really got that many single word sentences). Has anyone tried something similar or seen any commentary on this kind of thing elsewhere?