Multispeaker development progress

erogol · March 18, 2020, 12:31pm

multi GPU training does not improve the converge speed for TTS if you dont tune the hyperparameters like the learning rate

It depends on your dataset but you should see somethings after a day.

georroussos · March 18, 2020, 12:47pm

I was training on the 100 variant of LibriTTS, with a learning rate of 0.0001 and forward attention, a batch size of 64 and original prenet; for about a day. However, it seemed to overfit and so I am now trying bidirectional decoder; however, with a batch size of 64 it kept crapping out, hence why I asked. I am not trying with a size of 32 and bn prenet. Would you have any recommendations?

erogol · March 18, 2020, 12:48pm

Not really. Just see what works for you.

georroussos · March 18, 2020, 12:52pm

Thanks. Will keep trying things and see what works. Very exciting to play with the power of embeddings.

georroussos · March 20, 2020, 11:36am

Hi again, been trying to train this whole week, but unfortunately everything has been overfitting (plateuing early on and no improvements after a day). I have been training on Tacotron, first with LibriTTS 100, and then LibriTTS 360, female only. I tried Forward Attention at first (all three switches in config enabled), and bidirectional at some point, but unfortunately it craps out because of memory. I am trying with graves now and forward attention enabled as well. Should I give up Tacotron altogether and try Taco2? I cannot figure out why it is overfitting, I thought it would surely do okay.

erogol · March 20, 2020, 12:44pm

overfitting is not a huge deal with TTS. Just check the audio quality listening to the samples.

When you use graves attention forward attention is out of service by default.

georroussos · March 20, 2020, 12:50pm

Ah okay, thanks, I had a suspicion. It’s just that even after a day all I get in the testing samples is static noice and even teacher forcing in eval does not look great (plus the alignment score plateaus after 2k steps and does not improve at all).

erogol · March 23, 2020, 9:42am

if it is not the alignment, then there might be something else broken as well.

erogol · March 24, 2020, 1:33am

Any news? @georroussos I am really interested to see your results

georroussos · March 24, 2020, 9:08am

Right, so,

As I said, I ended up training (on the master branch) Tacotron on all the female speakers of Libri-TTS-360. At first I tried to use Forward Attention, but the only mechanism that worked was Graves. I tried bidirectional but sadly it didn’t work (kept throwing memory errors). I also needed to disable self.attention.init_win_idx() on layers/tacotron.py, because otherwise it refused to synthesize. I am at approximately 103k steps now. What I have gathered:

I have to restart training every 2 days, because the attention drops
In general, graves is a very good mechanism
I don’t know if it is the multispeaker nature, but it seems that, at random intervals, the alignment just drops extremely low and then it goes back up, gradually.
At 103k steps, the model is somewhat able to synthesize in different voices; however, punctuation like commas breaks it (it doesn’t read further than the comma break). I am uploading synthesis with speaker 1 as the flag and speaker 14 as the flag.

I don’t have any test spectograms, because I just started retraining again. Again, my goal is more of producing a novel speaker using embeddings. I will let it train more and see how it goes, then use external embeddings I have extracted using the speaker encoder and feed these, instead of the speaker embedding layer. I would like to contribute the model I am training right now for the TTS project page on git, if it turns out to be good, along with the changes needed for loading your own embeddings.

Any thoughts? I wonder if I can use this model to start a training session using forward attention, or batch normalization, after a certain number of steps. I also left r at 7, because gradual training is enabled, so I didn’t think changing it would do anything.

As always, thanks for all the work! Really happy if I can contribute in any way.

samples.zip (104.2 KB)

georroussos · March 24, 2020, 9:58am

Update: Test figures have come in

erogol · March 24, 2020, 12:00pm

for a multi-speaker mode it takes longer for the mode to show a reasonable performance. I could only see it works good enough after 700K iterations. So maybe it is better to be patient. You might also train another model finetunning a pretrained LJSpeech model. That might make the problem easier.

georroussos · March 24, 2020, 12:03pm

I tried to do it on a pretrained LJSpeech model, but it said that “as of now, you cannot introduce new speakers to an already trained model”.

georroussos · March 24, 2020, 12:03pm

I tried to do it on a pretrained LJSpeech model, but it said that “as of now, you cannot introduce new speakers to an already trained model”.

erogol · March 24, 2020, 12:04pm

yes you cannot do that but you can initialize a new model partially with the matching layers of the LJSpeech model. So if you give the model with --resume_path flag, it will load all the layer into your new model as their layers match in shape.

erogol · March 24, 2020, 12:08pm

@georroussos how did you integrate speaker vectors to the model? Do you have your code somewhere? I can check if everything looks alright.

georroussos · March 24, 2020, 12:15pm

Aha, but that is what I tried. Then I checked the code and I saw that, if the multispeaker embeddings option is enabled in the config, it checks if I also gave it --restore_path and if I did, it checks the .json file. But I will definitely try again. Which model would you recommend? I think the one trained on ForwardAttn and fine-tuned on BN would be a good candidate. But would I keep on training it with BN? And also, keep the config file from it?

I integrated speaker embeddings by editing Tacotron2.py in models/tacotron2.py. I changed the condition to if num_speakers > 0 (I know it is redundant), and then initiated a torch.FloatTensor variable, which included my embeddings. I created a lookup table torch.nn.Embedding.from_pretrained(weight) and froze the layer with self.speaker_embedding.weight.requires_grad = False. Something like this:

Then I think I finetuned the LJSpeech model for a while to include the embeddings (or not, really do not remember), and called it during inference time with --speaker_id 0. It is a hacky way, but the embeddings did load and did change the prosody, as we saw.

erogol · March 24, 2020, 12:17pm

maybe you can fork it and push your changes on github so we can collaborate.

You can take the latest released model but train it using just the location sensitive attention and normal prenet (not BN). If it trains well then you can switch to BN but I’d suggest to use forward attention only for inference.

georroussos · March 24, 2020, 12:23pm

I’d be super glad! I will fork now and start working on it.

Which one is the latest model? Is it Taco2 with Graves?

georroussos · March 24, 2020, 12:56pm

This is what comes up when I use the original config and the --restore_path flag.