Query regarding post processing

alchemi5t · August 19, 2019, 8:15am

Hello,

I have managed to train a model with 13hrs of annotated data. The alignment is great and the words from the generated test sentences are easily discernible. The only issue I have is that the generated audio is not a 100% human-like( a hint of consistent robotic chop-ups in it). I was wondering if I should fix that with post processing? or could I handle it with hyper parameter tuning?

carlfm01 · August 19, 2019, 4:11pm

With WaveRNN or just TTS?

alchemi5t · August 19, 2019, 4:35pm

Without wavernn. Just tacotron2 and griffinlim. I am trying out tacotron2+ erogol’s implementation of wavernn right now.

How much of a difference does wavernn make?

carlfm01 · August 19, 2019, 4:48pm

https://soundcloud.com/user-565970875/ljspeech-logistic-wavernn
WaveRNN using MOL sounds good to me.

alchemi5t · August 19, 2019, 4:56pm

I’ll try it out and post the results!

erogol · August 20, 2019, 10:50am

make sure you disable forward-attention in traning. That might give you a smoother results. Then you can enable it only at inference.

alchemi5t · August 20, 2019, 11:58am

Sounds good. I’ll try that out.

@erogol

Regarding your WaveRNN repo, in the ipynb for extracting mel specs, In generate model outputs( line 37) should that be inside the If condition for tacotron? Because of it’s not, the list mel_specs is empty and you cannot stack it. And also, the comment for dry_run is the opposite of what it should be, if I am not wrong. Did I miss something or need these be corrected?

Also, great job setting up wavernn! Thank you.

alchemi5t · August 21, 2019, 4:46am

Hey carl!

Have you tried out WaveRNN? How long does it take to train till 600k iterations? I am training on one Titan V with 32 workers and it’s giving me around 0.3 steps/sec, which would take over 20 days. is that right?

carlfm01 · August 21, 2019, 5:00am

Hello @alchemi5t

I’m on it too, no luck yet, attaching an example.

Still getting familiar with the behavior to know that the things will work. I’m fine tuning from the mold checkpoint and testing configs.

I think is too much, I use 8 or 6 for a V100.

Here’s the log for 500k from the 400k checkpoint:

log-wavernn.zip (70,8 KB)

tuxvocoder.zip (84,7 KB)

Using the defaults params and 30h of clean speech from a single speaker. I’m not sure if I train it more will succeed, I don’t see any improvement over the steps.

About that, try uninstalling torch and use 0.4.1, sort of remember getting the same speed

alchemi5t · August 21, 2019, 6:36am

Thank you for the detailed report. I tried with 6 to 8 workers and that slowed down my training to ~ 0.1 steps/sec. From your logs, you were getting 2.2 steps/sec (Granted that datasets are different, my longest seq is 420 long, so at best I should’ve slowed down only ~3.6(since default longest seq is 150 long) times, instead of the ~8 time retardation). Regardless, I’ll keep training it to see how things go.

What language is this? Is the output as expected or is it just gibberish?

I’ll try this out next.

Also, if possible, could you share your config file for WaveRNN?

carlfm01 · August 21, 2019, 6:53am

Yes

Spanish

Here: mold_ljspeech_best_model - Google Drive

Just changed my paths and mel_fmin to 50.

alchemi5t · August 21, 2019, 8:07am

I got a huge boost from 0.3 steps/sec to ~0.9 steps/sec. Great catch!

Just FYI,

"batch_size": 64,
"num_workers": 32,

carlfm01 · August 21, 2019, 8:12am

The default value for the batch_size is 32, 2.2/s steps for me using 32. fatchord mentioned that no one tried or at least shared results using values different than 32, keep in mind that it may not work due to the huge difference from the tested values.

alchemi5t · August 21, 2019, 8:19am

I’ll start a parallel run with 32 batch_size and see if that helps. I’ll report results for both.

carlfm01 · August 21, 2019, 8:21am

Awesome! Thanks, really helpful for people with limited compute power.

alchemi5t · August 21, 2019, 4:05pm

30k iterations in and the val loss is around 3.9 but all I hear is loud static. Did you have similar outputs? Because your log at 400k shows loss around 4. My inference script might be flawed. I’ll share it tomorrow, please check it out and share insights, if possible .

carlfm01 · August 21, 2019, 5:44pm

From checkpoint or from scratch?

Loud static noise, no. Just weird speech, like whisper but fast. Currently fine tuning taco2 on larger sentences, I’ll share results.

alchemi5t · August 21, 2019, 5:50pm

From scratch.

Mine is taco2 on fairly large sentences(largest is 420 characters long).

carlfm01 · August 21, 2019, 5:56pm

Needs more training, there was an issue where mentioned that 200k start to generate sort of speech. (Of ourse dataset related)

How did you fit 420 characters? I can’t even fit 290 with 16GB, on the last steps rush from 6GB to OOM

alchemi5t · August 22, 2019, 3:15am

That’s good news for me.

I trained it with a batch size of 16 and r=2. when i got it to r=1, I started getting OOM at which point i trained on a batchsize of 8. Like i said, the output of the TTS was great, just not very humanlike. Words, pronounciations and intonations were spot on at 150k steps. [on 13 hrs of data]

It’s consistently static, through different models, steps and settings. Very good chance the inference script is bad.
static.zip (560.9 KB)

Topic		Replies	Views
Training suddenly dropping in quality TTS (Text-to-Speech)	20	2471	August 18, 2020
Audio generated with TTS is a Bip TTS (Text-to-Speech) learning	4	2129	March 10, 2021
RuntimeError: CUDA out of memory TTS (Text-to-Speech)	2	2637	June 8, 2020
What are the TTS models you know to be faster than Tacotron? TTS (Text-to-Speech)	62	14650	April 25, 2021
Tacotron2 + PWGAN produces Deep/Muffled Voice TTS (Text-to-Speech)	9	2985	June 7, 2021

Query regarding post processing

Related topics