Transfer Learning / Model Refinement to change LJSpeech

Greetings folks!
I am new to AI and learning stuff online and trying to fiddle around with whats available in GitHub and Colab. Also doing the fast.ai free online course to get a better understanding.

The above preamble was meant to explain why my question below may seem noobish.

I am interested in eventually using a pre-trained model and refining it to learn a different voice which would be used for TTS. Its a purely personal project so I don’t have any quality standards to adhere to.

My question is how does one go about achieving both of the following through model refinement (assuming i’m using a pre-trained LJSpeech model, and have adequate speech samples of the other voices)

  1. Changing the voice to, using someone famous as an example, Dennis Quaid
    and simultaneously
  2. Changing the delivery to, again using someone famous as an example, Morgan Freeman.

Therefore, the output would be a refined voice TTS voice that sounds like Dennis but speaks with the cadence and delivery of Morgan.

Is it actually possible to achieve that through transfer learning?

Or, is this only possible to do by training a model from scratch on Morgan and then transfer learn to Dennis’ voice?

thanks!!

It works yes, I have done it. The dataset has to be relatively high quality and the duration of it should be approximately 6 hours and, after that, it took me 8 hours to fit.

Hi and thanks for the feedback George. That is very encouraging.

I think I should be able to put together 6 hours each of both voices

Can you also help me understand the sequence you applied to get good results. Is it the following:
Step 1: Identify Training base: lJSpeech (i would use the Tacotron 260K model as the starting point)
Step 2: Train/refine model against new voice 1 for about 8 hours to adapt prosody
Step 3: Train/refine against new voice 2 for about 8 hours to get new voice/pitch

Also, could you clarify how many steps you were able to complete in 8 hours? That would give me an idea of what to plan for. I am still learning, but I understood that training might be different for a single GPU (or low batch size) training setup vs multiple GPUS or even another single GPU with higher ram or clockspeed

Right, so the model I used was the one trained on forward attention and batch normalization. You get the model and you checkout the correct branch (most usually it’s dev). You don’t really need to pay attention to anything, really, besides making sure the dataset is as high of quality as possible. Mine was an internal one and it was a specific domain oriented, so it was pretty high quality, of approximately 6 hours, male, however the transcriptions were not 100% correct. It took me approximately 8 hours until I got good results, but I don’t know if it could’ve been better. The model was able to adapt both prosody and accent (which was mostly South American), together with voice.

So you would need to get the model and the correct branch, along with the configuration and make sure you can continue training on the same model. Chances are you will get some errors because of the config file, in this case I am happy to help debug.

I haven’t worked with style modeling yet (prosody, accent etc), but I can relatively certainly tell you this approach is not able to get you a novel speaker, that is, for example, one speaker’s voice with another’s accent. That is probably because of the single-speaker limitation of the model. If you were to fine tune the model twice, you could just get the second voice’s attributes. You can change the prosody either by using external embeddings, or global style tokens and providing a wav file as reference. Combining different voice attributes would probably take a multi-speaker model with embeddings support, which is what I am working on right now. I tried using embeddings on the single speaker model, but I only got as far as changing the prosody. If you’re looking to do voice adaptation (source → target), you definitely need multispeaker and embeddings.

In general, more GPUs do not imply higher convergence speed and most usually one GPU is enough.

Wow, thank for you the detailed explanation and sharing of your experience.

Nice! So, 8 hours + a high quality dataset and you were able to change both prosody and voice to a new speaker. That is very interesting.

I understand what you are saying about a different accent. I had thought of maintaining the same for both voices (probably US), so I’m probably okay there. I was not aware about the concept of external embeddings. I had seen the term “speaker embedding” on the forum in a few posts, but did not make the connection. Thank you for bringing this up, I will read about it. So basically with this properly baked in one could potentially change the accent from US to US-Texan, or from US to UK-Scottish, without training from scratch on a huge dataset comprised of samples from the target accent.

My main takeaway is that I should work on my training set sample before embarking upon any training to make sure it does not become the source of failure / time wasted in training.

Thanks!

Yes, the voice will change entirely to the new one :slight_smile: It will be like training a new TTS, it only converges faster on the new voice because of the previous steps on LJSpeech.

If you need help just give a man a shout. Best of luck!