From what I noticed trying to train with lower batch_size, If you see a good alignment and then it breaks almost sure it wont align back.
Same here, from my experience testing TTS and different Tacotron versions I think is better to throw away data rather than lowering the batch size.With TTS is really easy to find a good balance using the max lenght.
For tacotron2(not TTS) what I did was to sort the text using a text editor and remove the longer ones manually, most of the time just a few very long sentences ruins the whole thing.
I did read that, was wondering if someone could shine a light on the values and direct implications on memory, speed and alignment time for this implementation. (If anyone has logged that.)
I’ve not removed the sentences but i have decreased the max seq len to 200, still not able to run r=1 at 32 batch_size though.
Hope that’s a yes. I’d love to see what you’re working on and how it’s working out for you.
I sort of recall using using 100/200 with a k80 11GB, single GPU, then when I tried dual GPU it required to low a bit the max length, you get the same results using a single GPU?
I’ve removed everything that touches the prediction and now is working fine, T2_output_range well I think is ok to call it the “output/target scale”, or am I wrong?
I see gaps in the alignment, saw the same gaps while I was training TTS, I guess they are data related. About the audios I don’t hear a significative improvement from 10k to 25k, also I don’t hear an expressive speech on questions and special characters. I think is related to the speaker, the source voice is so flat. I’m cutting a more expressive female speech to adapt using the trained model, hopefully the issue is not the LPCNet being unable to be expressive over Tacotron predictions.
I haven’t tried yet, I am going to let this model train on 2000 sents and see what r=1 actually gives me in terms of quality of generated audio.(because according to 3.3 in the paper, thay’ve only discussed the major pros of having r>1 and not what the tradeoffs are, if any.)
I’ve trained it for around 30k steps and the quality is much better than what i had at r=2 but not better than waveRNN. I have to figure out some way to make it hog less VRAM so that i can acutally train the Taco2 on my entire dataset( followed by another wavernn training sess).