Multispeaker development progress

yes you cannot do that but you can initialize a new model partially with the matching layers of the LJSpeech model. So if you give the model with --resume_path flag, it will load all the layer into your new model as their layers match in shape.

@georroussos how did you integrate speaker vectors to the model? Do you have your code somewhere? I can check if everything looks alright.

Aha, but that is what I tried. Then I checked the code and I saw that, if the multispeaker embeddings option is enabled in the config, it checks if I also gave it --restore_path and if I did, it checks the .json file. But I will definitely try again. Which model would you recommend? I think the one trained on ForwardAttn and fine-tuned on BN would be a good candidate. But would I keep on training it with BN? And also, keep the config file from it?

I integrated speaker embeddings by editing Tacotron2.py in models/tacotron2.py. I changed the condition to if num_speakers > 0 (I know it is redundant), and then initiated a torch.FloatTensor variable, which included my embeddings. I created a lookup table torch.nn.Embedding.from_pretrained(weight) and froze the layer with self.speaker_embedding.weight.requires_grad = False. Something like this:

Then I think I finetuned the LJSpeech model for a while to include the embeddings (or not, really do not remember), and called it during inference time with --speaker_id 0. It is a hacky way, but the embeddings did load and did change the prosody, as we saw.

maybe you can fork it and push your changes on github so we can collaborate.

You can take the latest released model but train it using just the location sensitive attention and normal prenet (not BN). If it trains well then you can switch to BN but I’d suggest to use forward attention only for inference.

I’d be super glad! I will fork now and start working on it.

Which one is the latest model? Is it Taco2 with Graves?

This is what comes up when I use the original config and the --restore_path flag.

You can probably disable that assert for you run until we find a better check.

Nope, not working. Disabling the assert altogether gets to Epoch 0/1000 and throws this:

the error is nothing related. It is Memory Error. You can see the problem better is you run training on CPU.

Hi everyone,

Pardon the tardiness, but it has been a hectic time for me. I thought I would drop some updates on multispeaker.

First of all, I have not been able to get Tacotron (junior) to work at all. I do not know if it is my datasets, but it just refuses to align. Tacotron2, on the other hand, seems to do much better in terms of alignments. I have been trying some things here and there, but I still do not have a large enough dataset to get good results and I do not have the time, or the resources, to train on open source English datasets, as they are of no use to me and multispeaker seems to require a lot of training time.

It seems that a sequence limit of approximately 80 characters is what I can get away with, on an NVIDIA K80 GPU. Anything higher than that and CUDA craps out.

Fictitious voices are a go. I tried to train a dual speaker TTS and concatenating the speaker embeddings of both voices gives a mixed voice that doesn’t sound robotic and resembles the voice of the dominant in the dataset speaker, but is not the same one.

No experiments on GST, either. Trying to implement in on Taco2.

In general, the trend that I observe that data and its quality is probably the most critical thing. If anyone has anything to add, please. Cheers!

PS: Mozilla TTS is the best implementation of Taco2 out there.

3 Likes

Hey everyone,
I wanted some help in understanding the scope of this work. I initially stumbled upon this as I was looking at SV2TTS (https://arxiv.org/pdf/1806.04558.pdf) implementations and it felt like there was scope of injecting embeddings based off of the work that is being undertaken to generalize to multiple speakers.

However it appears that the embeddings in the current workflow are learned by the system based on initial training conditions and not provided by a separate encoder network, which would mean that using the system as is for implementations such as voice cloning is not possible.

Am I correct in this understanding or am I missing something?