Requesting Guidance for Training

bingecola · January 25, 2021, 1:42am

hi there,

looking from some guidance from the community. I hope someone can help.

my back ground: non-programmer but able to read up, copy-paste + modify to jerry rig some stuff in vba, js (and now python) for automating tasks & making work less tedious
interest in TTS: wanted to understand what ai models are all about; have had a very long standing interest in computer speech recognition / generation so took it as my path to learn about ai models and complete a personal project at the same time.
specific interest: would like to make a model which takes a voice I love listening to use it to generate high quality canned sentences. Quality is preferred over real time generation.
resources: not much. found out about free colab so am using that. seems pretty sweet (my laptop has no GPU)

where I am:

used some of the excellent resources + replies to previous posts by u/erogol, u/sanjaesc and u/edresson1
modified samples you have provided and made my own colab notebook
branch cloned: dev
tts = tacotron 2 [72a6ac5]
vocoder = ParallelWaveGAN [72a6ac5]
speech set used for testing that the notebook is working = LJSpeech
Note book can train the vocoder and the tts models
Note book can generate sentences of my choice after loading the base model or a model trained by me

what I want to do next

Train the model against that voice I like. It will take time and manual effort, but I should be able to put together a couple of hours of low noise clean audio, which i will then split, transcribe and remove leading/trailing silences from.

And finally my queries:

About the training sequence for tts model and vocoder. There are a couple of recipes in github but since I will probably be running 1 epoch at at time with Colabs limitations, I would like to understand the flow. Is my understanding as stated below correct?

Compute statistics. Only run this command 1 time before the first epoch with my personal wav dataset. Only re-run this command if I modify the wav dataset being used for training.
Train 1 epoch of the TTS and then train1 epoch of the Vocoder. Rinse and repeat and update the configs to refer to the latest and greatest versions of each.

I dont want to train a model from scratch , i dont have the resources. Is Mozilla TTS set up to utilize transfer learning? I know that there is a restore command and I have tested that I can get it to run. Will running the restore command on the base model from 72a6ac5 against my set of Wavs achieve transfer learning? And do I train the vocoder first with the ground truth Wavs from my dataset for a few epochs and only after that train the TTS against the my Wav dataset? Or should I train both sequentially for 1 epoch each?
Should I be using PWGAN + TACO2 single speaker if I am interested in transfer training? I know there is a WavGrad + TACO2 + multi-speaker model, but i couldnt understand the role of the speaker encoder (and how to train it) so I gave up and focused on PWGAN as my first experiment.
Given colab limitations, I will probably have to train a couple of epochs at at time, save and come back to it. How do I continue transfer training beyond the first 2 epochs which used ‘Restore’. Do I keep using restore the next day for the next couple of epochs? Or should I start using ‘Continue’?

Thank you so much for reading!