Query regarding post processing

it really depends on the dataset. For a new dataset, I generally train the model without it, then try it on aftterwards. you can also just enable it at inference.

2 Likes

@erogol

Could you check this?

if you mean postnet_outputs = torch.stack(mel_specs), no it is also necessary for Tacotron2 model.

But torch 0.4.1 doesn’t allow stacking an empty list. You get a runtime error saying “expected a non-empty list of sensors.”

I need to run it to say more. So I’ll when I’ve more time and get back to you

I’ll start training WaveRNN again with the my new taco2 model, I’ll really appreciate if you share your intuition of the issue, I mean to avoid waste compute.
Here’s the new config.
configs.zip (3,8 KB)

I see a few differences in our configs.

Let me just list out what I would do:

  1. train taco2 for long enough that it generates decent mel specs (test and train inputs), I.E., make sure you taco2 trained till it’s a decent tts with Griffin lim(not the quality of the voice, but how easily words are discernable, intonations e.t.c. For clarity, we’ll train wavernn).

  2. get erogol’s fork, the latest branch( cause I haven’t tried the fatchords, so cannot comment). There are a few parameters missing in your config file, without which it should throw errors(I’ll have to wait till monday to give you the exact params, but you shouldn’t have a problem adding those in while debugging the errors) . Add those in and you should be good.

  3. Also, I trim silences for training taco2 and i also had to trim silences in my wavernn. This was the major issue I had for the gibberish production. [Mrgloom on the github issues page(wavrnn+tacotron2 experiments) had something to say about this. He should be right with his premise that the mel specs aren’t obliged to be aligned with the audio(because of trimming).]

This is all I have for you right now. I’ll keep adding as and when I remember.

Good luck!

Also, from scratch, there is some semblance to human speech within 7ksteps, albeit with a lot of noise, but that should let you know if you’re screwed or not much earlier on(YMMV on this.)

Just fyi, getting high quality results after 130k. Top notch!

Hello @alchemi5t, thanks for sharing

Yes, I’m using erogol’s fork with the recommended commits.

I’m using deepspeech to preprocess and trim silence using the metadata feature, thus I disable the trimming.

Now hitting this issue with my new VM, now I don’t remember the pytorch version that bypassed the issue on my previous vm :confused:

Just put that line of code into the “if C.model == “Tacotron”:”'s scope. I have trained 2 wavernn models which work just fine with said modification.

I am not sure what you mean by that. What I understand is, you use ASR to preprocess and trim the file and the trimmed files are referenced to as audio in both training of TTS+GL and WaveRNN, If so, that should work fine. If your audio and generated melspec are not aligned is when you start to generate gibberish.

1 Like

Hello @alchemi5t I’ve tried your suggestion with other param combinations but, no luck, same output. While I was training erogol’s WaveRNN, I was also training LPCNet with impresive results using the extracted features. The results conviced me to try and combine taco2+LPCNet, here’s my result from the first try (with the same dataset):
lpcnet taco2.zip (3,0 MB)

I still need to figure out where taco2 decreased the volume, a filter somewhere ruined the output:

With tacotron2 features (weird from the middle to the top):

With extracted features:

About speed:
LCPNet is 3x faster than real time with AVX enabled on my very old AMD FX8350(only using one core), I’ve not tried with AVX2 enabled yet. The issue here is with tacotron2, is not real time, as erogol mentioned the 29m params of tacotron2 vs 7m params that TTS has is not a deal. Mixing TTS and LPCNet maybe is a good experiment, but to build a client using TTS and LPCNet we need to find out a way to mix pytorch and tensorflow (for me looks hard to achieve) or convert the tts model to tensorflow model (I know a couple of tools we can try).

I know @reuben is working on a client, maybe share insights? FYI I managed to compile LPCNet for Windows using Bazel.

I’ll share the trained models tomorrow if someone wants to fine tune.

That’s interesting. Happy to see new leads here.

I am sorry I couldn’t be of much help here. I’ll keep working on it to see if I find any hints.

This is very promising, I’ll start experimenting on this soon.

Also, have you successfully trained on multiple GPUS? I believe I have a pytorch issue which is hanging on dist.init_process_group( near line 84 in distribute.py).

I checked p2p communication and I got this:

Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

so I shouldn’t have problems training with 0,1 or 2,3 but the training hangs indefinitely. I can’t train on r=1 because of memory overflow. Any insights?

I’ve seen a lot of similar issues, most still open, probably because it silently(w/o errors or logs) hangs.

Yes, with 4 k80s and torch 04.1 for TTS, for WaveRNN only 1 GPU.

Try reducing the max lenght of the sentences.

Or maybe related to different GPUs? https://devtalk.nvidia.com/default/topic/1023825/peer2peer-with-2-nvidia-cards-geforce-gtx-1080-ti-and-titan-x/

see last comment

I wanted to only train on titan Vs. I fixed it by installing apex. Did you need apex?

No, for multi GPU with TTS just worked, no extra tricks.

Here’s the LPCNet trained model:


Trained using :https://github.com/MlWoo/LPCNet

For the tacotron2 model I’m still testing where’s the filter, once I get it properly working I’ll share it.

Thank you for sharing this.

I don’t understand what you mean by this. What are you trying to solve and how? ( Just curious)

@erogol @carlfm01

I’ve not managed to train a good model with
"r": 1, // Number of frames to predict for step.

Because of various reasons. Memory being the top issue; to deal with this, i reduced the batch size to 16(documented : Batch size for training. Lower values than 32 might cause hard to learn attention) which works without any OOM errors but it takes way too long to train and after 30k steps(which i know is low), the generated test audio is just empty whereas when i switch r=2, the generated audio has certain speaker qualities by 1.6k steps and by 20k steps you can easily discern what the TTS is trying to produce.

I have 4 questions.

  1. what are the implications of increasing or decreasing r.
  2. why does the small batch size not affect the training when r=2
  3. would training the r=1 model with 16 sized batch for longer be worthwhile.( I understand this can’t objectively be answered without experimentation. wondering in case someone has done similar experiments)
  4. Why is the memory requirement higher for r=1?
1 Like

I’m back to this topic :smiley:

Please read : https://arxiv.org/pdf/1703.10135.pdf (3.3 Decoder) the decoder sections will help you understand what’s going on.

From what I noticed trying to train with lower batch_size, If you see a good alignment and then it breaks almost sure it wont align back.

Same here, from my experience testing TTS and different Tacotron versions I think is better to throw away data rather than lowering the batch size.With TTS is really easy to find a good balance using the max lenght.

For tacotron2(not TTS) what I did was to sort the text using a text editor and remove the longer ones manually, most of the time just a few very long sentences ruins the whole thing.

1 Like

@erogol Hello, is OK to share my tests here in the forum even if they are not 100% related to Mozilla TTS but TTS in general?

FYI, I think I’ve solved the issue, Tacotron2 was using a “target mel scale”, I removed that scale clipping and now looks promising.

With just 5k steps the attention looks good, and the audio too. My previous attempts required at least 60k steps to start seeing the align.

10k step audios:
10k.zip (317,3 KB)

Good to see you back here!

I did read that, was wondering if someone could shine a light on the values and direct implications on memory, speed and alignment time for this implementation. (If anyone has logged that.)

I’ve not removed the sentences but i have decreased the max seq len to 200, still not able to run r=1 at 32 batch_size though.

Hope that’s a yes. I’d love to see what you’re working on and how it’s working out for you.

I think they only way, right now is to lower the max seq, there’s an issue:https://github.com/mozilla/TTS/issues/183 about OOM.

Hows your length distribution? If you go lower than 200 will lose a lot of data?