Query regarding post processing

30k iterations in and the val loss is around 3.9 but all I hear is loud static. Did you have similar outputs? Because your log at 400k shows loss around 4. My inference script might be flawed. I’ll share it tomorrow, please check it out and share insights, if possible .

From checkpoint or from scratch?

Loud static noise, no. Just weird speech, like whisper but fast. Currently fine tuning taco2 on larger sentences, I’ll share results.

From scratch.

Mine is taco2 on fairly large sentences(largest is 420 characters long).

Needs more training, there was an issue where mentioned that 200k start to generate sort of speech. (Of ourse dataset related)

How did you fit 420 characters? I can’t even fit 290 with 16GB, on the last steps rush from 6GB to OOM :confused:

That’s good news for me.

I trained it with a batch size of 16 and r=2. when i got it to r=1, I started getting OOM at which point i trained on a batchsize of 8. Like i said, the output of the TTS was great, just not very humanlike. Words, pronounciations and intonations were spot on at 150k steps. [on 13 hrs of data]

It’s consistently static, through different models, steps and settings. Very good chance the inference script is bad.
static.zip (560.9 KB)

so, "use_forward_attn": true will cause faster alignment, but worse audio quality?

Use the notebook, the best way to validate and avoid mistakes.

Yea, my script was wonky. notebook gave great results(1 experiment great, 1 gibberish similar to yours). I might know how to fix your gibberish producing network. mind sharing your config for both tts and wavernn? Also I am using tacotron 2. I’ll train for a few more days and give you concrete evidence.

I’m leaving, I’ll share ASAP tomorrow.

it really depends on the dataset. For a new dataset, I generally train the model without it, then try it on aftterwards. you can also just enable it at inference.

2 Likes

@erogol

Could you check this?

if you mean postnet_outputs = torch.stack(mel_specs), no it is also necessary for Tacotron2 model.

But torch 0.4.1 doesn’t allow stacking an empty list. You get a runtime error saying “expected a non-empty list of sensors.”

I need to run it to say more. So I’ll when I’ve more time and get back to you

I’ll start training WaveRNN again with the my new taco2 model, I’ll really appreciate if you share your intuition of the issue, I mean to avoid waste compute.
Here’s the new config.
configs.zip (3,8 KB)

I see a few differences in our configs.

Let me just list out what I would do:

  1. train taco2 for long enough that it generates decent mel specs (test and train inputs), I.E., make sure you taco2 trained till it’s a decent tts with Griffin lim(not the quality of the voice, but how easily words are discernable, intonations e.t.c. For clarity, we’ll train wavernn).

  2. get erogol’s fork, the latest branch( cause I haven’t tried the fatchords, so cannot comment). There are a few parameters missing in your config file, without which it should throw errors(I’ll have to wait till monday to give you the exact params, but you shouldn’t have a problem adding those in while debugging the errors) . Add those in and you should be good.

  3. Also, I trim silences for training taco2 and i also had to trim silences in my wavernn. This was the major issue I had for the gibberish production. [Mrgloom on the github issues page(wavrnn+tacotron2 experiments) had something to say about this. He should be right with his premise that the mel specs aren’t obliged to be aligned with the audio(because of trimming).]

This is all I have for you right now. I’ll keep adding as and when I remember.

Good luck!

Also, from scratch, there is some semblance to human speech within 7ksteps, albeit with a lot of noise, but that should let you know if you’re screwed or not much earlier on(YMMV on this.)

Just fyi, getting high quality results after 130k. Top notch!

Hello @alchemi5t, thanks for sharing

Yes, I’m using erogol’s fork with the recommended commits.

I’m using deepspeech to preprocess and trim silence using the metadata feature, thus I disable the trimming.

Now hitting this issue with my new VM, now I don’t remember the pytorch version that bypassed the issue on my previous vm :confused:

Just put that line of code into the “if C.model == “Tacotron”:”'s scope. I have trained 2 wavernn models which work just fine with said modification.

I am not sure what you mean by that. What I understand is, you use ASR to preprocess and trim the file and the trimmed files are referenced to as audio in both training of TTS+GL and WaveRNN, If so, that should work fine. If your audio and generated melspec are not aligned is when you start to generate gibberish.

1 Like

Hello @alchemi5t I’ve tried your suggestion with other param combinations but, no luck, same output. While I was training erogol’s WaveRNN, I was also training LPCNet with impresive results using the extracted features. The results conviced me to try and combine taco2+LPCNet, here’s my result from the first try (with the same dataset):
lpcnet taco2.zip (3,0 MB)

I still need to figure out where taco2 decreased the volume, a filter somewhere ruined the output:

With tacotron2 features (weird from the middle to the top):

With extracted features:

About speed:
LCPNet is 3x faster than real time with AVX enabled on my very old AMD FX8350(only using one core), I’ve not tried with AVX2 enabled yet. The issue here is with tacotron2, is not real time, as erogol mentioned the 29m params of tacotron2 vs 7m params that TTS has is not a deal. Mixing TTS and LPCNet maybe is a good experiment, but to build a client using TTS and LPCNet we need to find out a way to mix pytorch and tensorflow (for me looks hard to achieve) or convert the tts model to tensorflow model (I know a couple of tools we can try).

I know @reuben is working on a client, maybe share insights? FYI I managed to compile LPCNet for Windows using Bazel.

I’ll share the trained models tomorrow if someone wants to fine tune.

That’s interesting. Happy to see new leads here.

I am sorry I couldn’t be of much help here. I’ll keep working on it to see if I find any hints.

This is very promising, I’ll start experimenting on this soon.

Also, have you successfully trained on multiple GPUS? I believe I have a pytorch issue which is hanging on dist.init_process_group( near line 84 in distribute.py).

I checked p2p communication and I got this:

Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

so I shouldn’t have problems training with 0,1 or 2,3 but the training hangs indefinitely. I can’t train on r=1 because of memory overflow. Any insights?

I’ve seen a lot of similar issues, most still open, probably because it silently(w/o errors or logs) hangs.