Just thought I’d share some simple observations, would enjoy hearing from other to see if what I am seeing makes sense.
These are some of the better samples from a tt1 model (no bn, no fwd attn and using phonemes) on a random 500 speaker subset of LibriTTS:
soundcloud link
Some observations:
-TT1 > TT2 for any configuration I try, convergence is much quicker, quality is clearly much better.
-Some speakers sound pretty good, some sound really “borgy”. More males sound borgy than females (I set mel_fmin to 50.0, not sure how to set with a mixed sex dataset). I’m still not totally sure what contributes the most to “borginess”.
-I used to wonder if Griffin-Lim was my problem, but some of the samples sound reasonably good. I am sure they can be improved with better vocoding, but I don’t think better vocoding can ‘rescue’ models that sound worse than the samples I currently have.
Have people been able to do much better with multi speaker? I believe the average length of a speaker is reasonably short in LibriTTS (~21 min IIRC), but I haven’t been able to get high quality across all speakers. I’ve tried with TT2 and the WaveRNN universal vocoder without any luck so far (I’m currently trying to train my own WaveRNN model).