Query regarding post processing

so, "use_forward_attn": true will cause faster alignment, but worse audio quality?

Use the notebook, the best way to validate and avoid mistakes.

Yea, my script was wonky. notebook gave great results(1 experiment great, 1 gibberish similar to yours). I might know how to fix your gibberish producing network. mind sharing your config for both tts and wavernn? Also I am using tacotron 2. Iā€™ll train for a few more days and give you concrete evidence.

Iā€™m leaving, Iā€™ll share ASAP tomorrow.

it really depends on the dataset. For a new dataset, I generally train the model without it, then try it on aftterwards. you can also just enable it at inference.

2 Likes

@erogol

Could you check this?

if you mean postnet_outputs = torch.stack(mel_specs), no it is also necessary for Tacotron2 model.

But torch 0.4.1 doesnā€™t allow stacking an empty list. You get a runtime error saying ā€œexpected a non-empty list of sensors.ā€

I need to run it to say more. So Iā€™ll when Iā€™ve more time and get back to you

Iā€™ll start training WaveRNN again with the my new taco2 model, Iā€™ll really appreciate if you share your intuition of the issue, I mean to avoid waste compute.
Hereā€™s the new config.
configs.zip (3,8 KB)

I see a few differences in our configs.

Let me just list out what I would do:

  1. train taco2 for long enough that it generates decent mel specs (test and train inputs), I.E., make sure you taco2 trained till itā€™s a decent tts with Griffin lim(not the quality of the voice, but how easily words are discernable, intonations e.t.c. For clarity, weā€™ll train wavernn).

  2. get erogolā€™s fork, the latest branch( cause I havenā€™t tried the fatchords, so cannot comment). There are a few parameters missing in your config file, without which it should throw errors(Iā€™ll have to wait till monday to give you the exact params, but you shouldnā€™t have a problem adding those in while debugging the errors) . Add those in and you should be good.

  3. Also, I trim silences for training taco2 and i also had to trim silences in my wavernn. This was the major issue I had for the gibberish production. [Mrgloom on the github issues page(wavrnn+tacotron2 experiments) had something to say about this. He should be right with his premise that the mel specs arenā€™t obliged to be aligned with the audio(because of trimming).]

This is all I have for you right now. Iā€™ll keep adding as and when I remember.

Good luck!

Also, from scratch, there is some semblance to human speech within 7ksteps, albeit with a lot of noise, but that should let you know if youā€™re screwed or not much earlier on(YMMV on this.)

Just fyi, getting high quality results after 130k. Top notch!

Hello @alchemi5t, thanks for sharing

Yes, Iā€™m using erogolā€™s fork with the recommended commits.

Iā€™m using deepspeech to preprocess and trim silence using the metadata feature, thus I disable the trimming.

Now hitting this issue with my new VM, now I donā€™t remember the pytorch version that bypassed the issue on my previous vm :confused:

Just put that line of code into the ā€œif C.model == ā€œTacotronā€:ā€'s scope. I have trained 2 wavernn models which work just fine with said modification.

I am not sure what you mean by that. What I understand is, you use ASR to preprocess and trim the file and the trimmed files are referenced to as audio in both training of TTS+GL and WaveRNN, If so, that should work fine. If your audio and generated melspec are not aligned is when you start to generate gibberish.

1 Like

Hello @alchemi5t Iā€™ve tried your suggestion with other param combinations but, no luck, same output. While I was training erogolā€™s WaveRNN, I was also training LPCNet with impresive results using the extracted features. The results conviced me to try and combine taco2+LPCNet, hereā€™s my result from the first try (with the same dataset):
lpcnet taco2.zip (3,0 MB)

I still need to figure out where taco2 decreased the volume, a filter somewhere ruined the output:

With tacotron2 features (weird from the middle to the top):

With extracted features:

About speed:
LCPNet is 3x faster than real time with AVX enabled on my very old AMD FX8350(only using one core), Iā€™ve not tried with AVX2 enabled yet. The issue here is with tacotron2, is not real time, as erogol mentioned the 29m params of tacotron2 vs 7m params that TTS has is not a deal. Mixing TTS and LPCNet maybe is a good experiment, but to build a client using TTS and LPCNet we need to find out a way to mix pytorch and tensorflow (for me looks hard to achieve) or convert the tts model to tensorflow model (I know a couple of tools we can try).

I know @reuben is working on a client, maybe share insights? FYI I managed to compile LPCNet for Windows using Bazel.

Iā€™ll share the trained models tomorrow if someone wants to fine tune.

Thatā€™s interesting. Happy to see new leads here.

I am sorry I couldnā€™t be of much help here. Iā€™ll keep working on it to see if I find any hints.

This is very promising, Iā€™ll start experimenting on this soon.

Also, have you successfully trained on multiple GPUS? I believe I have a pytorch issue which is hanging on dist.init_process_group( near line 84 in distribute.py).

I checked p2p communication and I got this:

Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

so I shouldnā€™t have problems training with 0,1 or 2,3 but the training hangs indefinitely. I canā€™t train on r=1 because of memory overflow. Any insights?

Iā€™ve seen a lot of similar issues, most still open, probably because it silently(w/o errors or logs) hangs.

Yes, with 4 k80s and torch 04.1 for TTS, for WaveRNN only 1 GPU.

Try reducing the max lenght of the sentences.

Or maybe related to different GPUs? https://devtalk.nvidia.com/default/topic/1023825/peer2peer-with-2-nvidia-cards-geforce-gtx-1080-ti-and-titan-x/

see last comment

I wanted to only train on titan Vs. I fixed it by installing apex. Did you need apex?

No, for multi GPU with TTS just worked, no extra tricks.

Hereā€™s the LPCNet trained model:


Trained using :https://github.com/MlWoo/LPCNet

For the tacotron2 model Iā€™m still testing whereā€™s the filter, once I get it properly working Iā€™ll share it.

Thank you for sharing this.

I donā€™t understand what you mean by this. What are you trying to solve and how? ( Just curious)

@erogol @carlfm01

Iā€™ve not managed to train a good model with
"r": 1, // Number of frames to predict for step.

Because of various reasons. Memory being the top issue; to deal with this, i reduced the batch size to 16(documented : Batch size for training. Lower values than 32 might cause hard to learn attention) which works without any OOM errors but it takes way too long to train and after 30k steps(which i know is low), the generated test audio is just empty whereas when i switch r=2, the generated audio has certain speaker qualities by 1.6k steps and by 20k steps you can easily discern what the TTS is trying to produce.

I have 4 questions.

  1. what are the implications of increasing or decreasing r.
  2. why does the small batch size not affect the training when r=2
  3. would training the r=1 model with 16 sized batch for longer be worthwhile.( I understand this canā€™t objectively be answered without experimentation. wondering in case someone has done similar experiments)
  4. Why is the memory requirement higher for r=1?
1 Like