After some email discourse with Eren, I am creating this thread for multi-speaker related progress on the Mozilla TTS. Particular objectives include manipulating speaker encoder embeddings for purposes of voice altering (creation of mixed voices) and conditioning the TTS using them.
As Eren mentioned to me, there is no multi-speaker doc on the git at present. My goal is to maybe write up something myself, if I come up with anything worthy of looking into.
If anyone has any questions/results/thoughts they want to share, please!
As I have been discussing with Eren, I am interested in seeing how we can manipulate a TTS voice quality, using external embeddings, however I have been having trouble using external embeddings extracted using the speaker encoder. Eren suggested initializing the speaker embedding layer with the tensor I have and then freezing the weights, however I have not really been successful. If anyone has ever worked with it or has any guidance, please go ahead!
A little update. I was able to, after all, modify the pipeline a bit and feed it my own custom embeddings. Some info on it: I extracted them using the speaker encoder. They were the mean of embeddings coming from different speakers (from LibriTTS). I modified Tacotron2 to feed these embeddings during training and inference time and tried to load a pretrained, on LJSpeech, model and see if the voice would change in any way when the embeddings were used. The voice did not change, but the prosody does change, when the embeddings are used (attaching 2 samples). I suspect it is due to the nature of the model being single-speaker, so it can simply not generalize, however if anyone has any input, let’s hear it.
Now I will try to train a multi-speaker model on Tacotron and see if I can feed it my own embeddings and maybe generate novel speakers. Very interesting to see results.
@erogol@nmstoker any thoughts? Maybe I did something wrong, but I did hope maybe finetuning the pretrained model on the embeddings for a little while might change the voice, but alas, it did not…
you need to train the model in multi-speaker settings to see it works with the embeddings. Single speaker model would not generalize for a different speaker.
I thought the same. Well then, it is what I shall do. Definitely promising that, even in single speaker, custom embeddings change the prosody and I think it is a good sign for GST modelling. Will come back with multi-speaker results.
I was training on the 100 variant of LibriTTS, with a learning rate of 0.0001 and forward attention, a batch size of 64 and original prenet; for about a day. However, it seemed to overfit and so I am now trying bidirectional decoder; however, with a batch size of 64 it kept crapping out, hence why I asked. I am not trying with a size of 32 and bn prenet. Would you have any recommendations?
Hi again, been trying to train this whole week, but unfortunately everything has been overfitting (plateuing early on and no improvements after a day). I have been training on Tacotron, first with LibriTTS 100, and then LibriTTS 360, female only. I tried Forward Attention at first (all three switches in config enabled), and bidirectional at some point, but unfortunately it craps out because of memory. I am trying with graves now and forward attention enabled as well. Should I give up Tacotron altogether and try Taco2? I cannot figure out why it is overfitting, I thought it would surely do okay.
Ah okay, thanks, I had a suspicion. It’s just that even after a day all I get in the testing samples is static noice and even teacher forcing in eval does not look great (plus the alignment score plateaus after 2k steps and does not improve at all).
As I said, I ended up training (on the master branch) Tacotron on all the female speakers of Libri-TTS-360. At first I tried to use Forward Attention, but the only mechanism that worked was Graves. I tried bidirectional but sadly it didn’t work (kept throwing memory errors). I also needed to disable self.attention.init_win_idx() on layers/tacotron.py, because otherwise it refused to synthesize. I am at approximately 103k steps now. What I have gathered:
I have to restart training every 2 days, because the attention drops
In general, graves is a very good mechanism
I don’t know if it is the multispeaker nature, but it seems that, at random intervals, the alignment just drops extremely low and then it goes back up, gradually.
At 103k steps, the model is somewhat able to synthesize in different voices; however, punctuation like commas breaks it (it doesn’t read further than the comma break). I am uploading synthesis with speaker 1 as the flag and speaker 14 as the flag.
I don’t have any test spectograms, because I just started retraining again. Again, my goal is more of producing a novel speaker using embeddings. I will let it train more and see how it goes, then use external embeddings I have extracted using the speaker encoder and feed these, instead of the speaker embedding layer. I would like to contribute the model I am training right now for the TTS project page on git, if it turns out to be good, along with the changes needed for loading your own embeddings.
Any thoughts? I wonder if I can use this model to start a training session using forward attention, or batch normalization, after a certain number of steps. I also left r at 7, because gradual training is enabled, so I didn’t think changing it would do anything.
As always, thanks for all the work! Really happy if I can contribute in any way.
for a multi-speaker mode it takes longer for the mode to show a reasonable performance. I could only see it works good enough after 700K iterations. So maybe it is better to be patient. You might also train another model finetunning a pretrained LJSpeech model. That might make the problem easier.