As I said, I ended up training (on the master branch) Tacotron on all the female speakers of Libri-TTS-360. At first I tried to use Forward Attention, but the only mechanism that worked was Graves. I tried bidirectional but sadly it didn’t work (kept throwing memory errors). I also needed to disable
layers/tacotron.py, because otherwise it refused to synthesize. I am at approximately 103k steps now. What I have gathered:
- I have to restart training every 2 days, because the attention drops
- In general, graves is a very good mechanism
- I don’t know if it is the multispeaker nature, but it seems that, at random intervals, the alignment just drops extremely low and then it goes back up, gradually.
- At 103k steps, the model is somewhat able to synthesize in different voices; however, punctuation like commas breaks it (it doesn’t read further than the comma break). I am uploading synthesis with speaker 1 as the flag and speaker 14 as the flag.
I don’t have any test spectograms, because I just started retraining again. Again, my goal is more of producing a novel speaker using embeddings. I will let it train more and see how it goes, then use external embeddings I have extracted using the speaker encoder and feed these, instead of the speaker embedding layer. I would like to contribute the model I am training right now for the TTS project page on git, if it turns out to be good, along with the changes needed for loading your own embeddings.
Any thoughts? I wonder if I can use this model to start a training session using forward attention, or batch normalization, after a certain number of steps. I also left r at 7, because gradual training is enabled, so I didn’t think changing it would do anything.
As always, thanks for all the work! Really happy if I can contribute in any way.