I have managed to train a model with 13hrs of annotated data. The alignment is great and the words from the generated test sentences are easily discernible. The only issue I have is that the generated audio is not a 100% human-like( a hint of consistent robotic chop-ups in it). I was wondering if I should fix that with post processing? or could I handle it with hyper parameter tuning?
Regarding your WaveRNN repo, in the ipynb for extracting mel specs, In generate model outputs( line 37) should that be inside the If condition for tacotron? Because of it’s not, the list mel_specs is empty and you cannot stack it. And also, the comment for dry_run is the opposite of what it should be, if I am not wrong. Did I miss something or need these be corrected?
Have you tried out WaveRNN? How long does it take to train till 600k iterations? I am training on one Titan V with 32 workers and it’s giving me around 0.3 steps/sec, which would take over 20 days. is that right?
Using the defaults params and 30h of clean speech from a single speaker. I’m not sure if I train it more will succeed, I don’t see any improvement over the steps.
About that, try uninstalling torch and use 0.4.1, sort of remember getting the same speed
Thank you for the detailed report. I tried with 6 to 8 workers and that slowed down my training to ~ 0.1 steps/sec. From your logs, you were getting 2.2 steps/sec (Granted that datasets are different, my longest seq is 420 long, so at best I should’ve slowed down only ~3.6(since default longest seq is 150 long) times, instead of the ~8 time retardation). Regardless, I’ll keep training it to see how things go.
What language is this? Is the output as expected or is it just gibberish?
I’ll try this out next.
Also, if possible, could you share your config file for WaveRNN?
The default value for the batch_size is 32, 2.2/s steps for me using 32. fatchord mentioned that no one tried or at least shared results using values different than 32, keep in mind that it may not work due to the huge difference from the tested values.
30k iterations in and the val loss is around 3.9 but all I hear is loud static. Did you have similar outputs? Because your log at 400k shows loss around 4. My inference script might be flawed. I’ll share it tomorrow, please check it out and share insights, if possible .
I trained it with a batch size of 16 and r=2. when i got it to r=1, I started getting OOM at which point i trained on a batchsize of 8. Like i said, the output of the TTS was great, just not very humanlike. Words, pronounciations and intonations were spot on at 150k steps. [on 13 hrs of data]
It’s consistently static, through different models, steps and settings. Very good chance the inference script is bad. static.zip (560.9 KB)