I would like to start experimenting with GST, but the problem I have is that my dataset seems to perform quite well on Taco2 (reaches alignment score of 0.65), while it just doesn’t do anything when Taco is the model of choice (weird, since Taco is supposed to be smaller). Is there any way to make GST work on Taco2?
You just need to code it. It should be easy.
I have tried to edit tacotron2.py in models and include the layer, however I am not sure if I did it correctly. Is there any way I can get some pointers Because I have started training a new model now and put some prints here and there, but they are not coming up, so I don’t know if this is implemented or not. I changed the code in taoctron2.py and enabled gst learning in the config file, but didn’t edit train.py.
tacotron2.py.zip (2.1 KB)
@edresson1 is working on the same problem. You might help each other.
Thanks! I think I got it to work, it does output the tensor from the gst layer, but I have not started training yet, because I am extending my dataset. Will report results.
Basically change the models/tacotron2.py, layers/tacotron2.py and utils/synthesis.py
I fixed some bugs in this commit in models/tacotron.py and in generic_utils.py just ignore them.
I am concatenating the GST embedding since adding it. My inspired at https://github.com/NVIDIA/mellotron
I’m interested in this too and took a closer look at what you’re doing in your voice-cloning branch. It seems you’re putting a lot of work into it, which is pretty cool! Let’s see if I get the process right, maybe that’s also interesting for others.
Let’s assume I’d only be interested in the Voice-Cloning part, ignoring GST.
My understanding is, that I’d first train the Speaker_Encoder on a multispeaker dataset, e.g. LibriTTS (alternatively I could use the pretrained model). Then, I’d use this notebook, to extract the Speaker Embeddings. Each embedding corresponds to one utterance spoken by a speaker, correctly? The notebook produces a Mapping-File that I use for training Tacotron2, by setting the
speaker_embedding_file parameter in the
config.json. I noticed that in your notebook, the paths indicate that you used the LibriTTS for training the Speaker Encoding, but your portuguese dataset to extract the embeddings. My intuition would be, that both should be the same though? Also, I’m wondering if you tried as to whether averaged embeddings per speaker are of advantage during training, as proposed in this issue.
Inference and Cloning
At inference, I’d pass a the
speaker_fileid arguments to synthesize speech in the voice of one of the embeddings included in the file. Now, technically it should be possible to also pass a file with new embeddings, that weren’t introduced to the network at training, to synthesize new voices, right?
Yes this is the idea :),
This is all still experimental, I’m testing different configurations until I find the best one.
Sample embeddings are best for cloning under this article: http://papers.nips.cc/paper/7700-transfer-learning-from-speaker-verification-to-multispeaker-text-to-speech-synthesis.pdf
About the Speaker encoder being trained in the same dataset as the synthesis model, this would only help to improve the quality of the training, after cloning the model would be receiving unseen speakers and this would affect the quality of the cloning.