Is it real to train TTS on 3 hours long 2400 samples?

The title says by itself, I can’t get more samples and that’s kinda pain :frowning:

you can try to finetune one of the pretrained models but in general it is not enough for solid results.

Thanks, so even for finetune samples count isn’t enough?

It would be great if you gave it a go and report back here. That’ll help everyone at the same time as empowering you to answer your own question.

1 Like

were you able to get any answer on this? I am also trying to train some indic dataset of just 3 hours of recording - training on the nvidia/tacotron2 results in overfitting

In general what type of algorithms are more suitable for small indic datasets? the letter to phoneme mapping is always the same in most indic languages - i.e. a letter will always map to the same phoneme irrespective of the word it is located in.

Hi, I can share my experience.

I used ~4000 audio fragments with transfer learning, and these are the results: https://drive.google.com/drive/u/0/folders/1OBbt9tgMKPJ6JSaVaqp-NV2EbwdLU2Fv

The original model (trained on 15.000 fragments) generated the following fragments:
https://drive.google.com/drive/folders/18qXOxd7Mj3pvsRbM2O2b__wInhVRzyHO?usp=sharing

I must say, the second dataset of 4000 files was of very low quality and I didn’t have adequate knowledge to properly finetune the configuration file.

1 Like