I see a few differences in our configs.
Let me just list out what I would do:
-
train taco2 for long enough that it generates decent mel specs (test and train inputs), I.E., make sure you taco2 trained till it’s a decent tts with Griffin lim(not the quality of the voice, but how easily words are discernable, intonations e.t.c. For clarity, we’ll train wavernn).
-
get erogol’s fork, the latest branch( cause I haven’t tried the fatchords, so cannot comment). There are a few parameters missing in your config file, without which it should throw errors(I’ll have to wait till monday to give you the exact params, but you shouldn’t have a problem adding those in while debugging the errors) . Add those in and you should be good.
-
Also, I trim silences for training taco2 and i also had to trim silences in my wavernn. This was the major issue I had for the gibberish production. [Mrgloom on the github issues page(wavrnn+tacotron2 experiments) had something to say about this. He should be right with his premise that the mel specs aren’t obliged to be aligned with the audio(because of trimming).]
This is all I have for you right now. I’ll keep adding as and when I remember.
Good luck!
Also, from scratch, there is some semblance to human speech within 7ksteps, albeit with a lot of noise, but that should let you know if you’re screwed or not much earlier on(YMMV on this.)
Just fyi, getting high quality results after 130k. Top notch!