Multispeaker versus transfer learning

Inspired by @mrthorstenm I decided to create my own dataset as well. Starting from my own desk with a crappy headset microphone, I soon moved on to more professional methods.

In the end I hired two male voice talents, they will each provide me with 20-25 hours of Belgian Dutch voice data over the course of the coming two months. My aim is to create other voices from this data as well, hopefully with a minimum of data. I asked them to record in mono WAV format, 44.1 kHz and 16-bit audio.

Should I train two separate tacatron2 models, check which one is most suitable, and use transfer learning or is the state of the current multi-speaker training good enough and easier to work with to generate future voices?

Are there any other tips or suggestions which I should think about?

Any help or input is appreciated.


Try all three? One t2 model for each individual, and a multispeaker. Then based on which turns out best for pronunciation, use that to do transfer learning (or continued learning as needed) from.

1 Like