Query regarding post processing


I have managed to train a model with 13hrs of annotated data. The alignment is great and the words from the generated test sentences are easily discernible. The only issue I have is that the generated audio is not a 100% human-like( a hint of consistent robotic chop-ups in it). I was wondering if I should fix that with post processing? or could I handle it with hyper parameter tuning?

With WaveRNN or just TTS?

Without wavernn. Just tacotron2 and griffinlim. I am trying out tacotron2+ erogol’s implementation of wavernn right now.

How much of a difference does wavernn make?

WaveRNN using MOL sounds good to me.

1 Like

I’ll try it out and post the results!

make sure you disable forward-attention in traning. That might give you a smoother results. Then you can enable it only at inference.

Sounds good. I’ll try that out.


Regarding your WaveRNN repo, in the ipynb for extracting mel specs, In generate model outputs( line 37) should that be inside the If condition for tacotron? Because of it’s not, the list mel_specs is empty and you cannot stack it. And also, the comment for dry_run is the opposite of what it should be, if I am not wrong. Did I miss something or need these be corrected?

Also, great job setting up wavernn! Thank you.

Hey carl!

Have you tried out WaveRNN? How long does it take to train till 600k iterations? I am training on one Titan V with 32 workers and it’s giving me around 0.3 steps/sec, which would take over 20 days. is that right?

Hello @alchemi5t

I’m on it too, no luck yet, attaching an example.

Still getting familiar with the behavior to know that the things will work. I’m fine tuning from the mold checkpoint and testing configs.

I think is too much, I use 8 or 6 for a V100.

Here’s the log for 500k from the 400k checkpoint:

log-wavernn.zip (70,8 KB)

tuxvocoder.zip (84,7 KB)

Using the defaults params and 30h of clean speech from a single speaker. I’m not sure if I train it more will succeed, I don’t see any improvement over the steps.

About that, try uninstalling torch and use 0.4.1, sort of remember getting the same speed

1 Like

Thank you for the detailed report. I tried with 6 to 8 workers and that slowed down my training to ~ 0.1 steps/sec. From your logs, you were getting 2.2 steps/sec (Granted that datasets are different, my longest seq is 420 long, so at best I should’ve slowed down only ~3.6(since default longest seq is 150 long) times, instead of the ~8 time retardation). Regardless, I’ll keep training it to see how things go.

What language is this? Is the output as expected or is it just gibberish?

I’ll try this out next.

Also, if possible, could you share your config file for WaveRNN?



Here: https://drive.google.com/drive/folders/1wpPn3a0KQc6EYtKL0qOi4NqEmhML71Ve

Just changed my paths and mel_fmin to 50.

1 Like

I got a huge boost from 0.3 steps/sec to ~0.9 steps/sec. Great catch!

Just FYI,

"batch_size": 64,
"num_workers": 32,

The default value for the batch_size is 32, 2.2/s steps for me using 32. fatchord mentioned that no one tried or at least shared results using values different than 32, keep in mind that it may not work due to the huge difference from the tested values.

I’ll start a parallel run with 32 batch_size and see if that helps. I’ll report results for both.

Awesome! Thanks, really helpful for people with limited compute power.

30k iterations in and the val loss is around 3.9 but all I hear is loud static. Did you have similar outputs? Because your log at 400k shows loss around 4. My inference script might be flawed. I’ll share it tomorrow, please check it out and share insights, if possible .

From checkpoint or from scratch?

Loud static noise, no. Just weird speech, like whisper but fast. Currently fine tuning taco2 on larger sentences, I’ll share results.

From scratch.

Mine is taco2 on fairly large sentences(largest is 420 characters long).

Needs more training, there was an issue where mentioned that 200k start to generate sort of speech. (Of ourse dataset related)

How did you fit 420 characters? I can’t even fit 290 with 16GB, on the last steps rush from 6GB to OOM :confused:

That’s good news for me.

I trained it with a batch size of 16 and r=2. when i got it to r=1, I started getting OOM at which point i trained on a batchsize of 8. Like i said, the output of the TTS was great, just not very humanlike. Words, pronounciations and intonations were spot on at 150k steps. [on 13 hrs of data]

It’s consistently static, through different models, steps and settings. Very good chance the inference script is bad.
static.zip (560.9 KB)