Is it real to train TTS on 3 hours long 2400 samples?

CSharpRU · January 11, 2021, 7:02pm

The title says by itself, I can’t get more samples and that’s kinda pain

erogol · January 13, 2021, 9:59am

you can try to finetune one of the pretrained models but in general it is not enough for solid results.

CSharpRU · January 15, 2021, 1:55am

Thanks, so even for finetune samples count isn’t enough?

nmstoker · January 16, 2021, 2:42am

It would be great if you gave it a go and report back here. That’ll help everyone at the same time as empowering you to answer your own question.

pathnirvana · January 26, 2021, 10:37am

were you able to get any answer on this? I am also trying to train some indic dataset of just 3 hours of recording - training on the nvidia/tacotron2 results in overfitting

In general what type of algorithms are more suitable for small indic datasets? the letter to phoneme mapping is always the same in most indic languages - i.e. a letter will always map to the same phoneme irrespective of the word it is located in.

rdh · January 28, 2021, 12:29pm

Hi, I can share my experience.

I used ~4000 audio fragments with transfer learning, and these are the results: https://drive.google.com/drive/u/0/folders/1OBbt9tgMKPJ6JSaVaqp-NV2EbwdLU2Fv

The original model (trained on 15.000 fragments) generated the following fragments:
https://drive.google.com/drive/folders/18qXOxd7Mj3pvsRbM2O2b__wInhVRzyHO?usp=sharing

I must say, the second dataset of 4000 files was of very low quality and I didn’t have adequate knowledge to properly finetune the configuration file.