I want to try training with a large batch of custom audio samples, but they are all pulled automatically from very long videos. So far I have split them by sentence, but that leaves many very short samples (e.g. a lot of “Yeah.”, “Okay.”, etc.), as well as some long run-on sentences which can go up to 20s or more.
Is there any recommended length of samples? I could just try to split the longer ones and aggregate the shorter ones to their “neighbours” if standardizing their length is important.
I was also getting warnings regarding the amplitude being clipped. Is this important? If so I can try to make sure the samples are equalized in volume but if not I may just leave it as it is.
Also, is there a general “minimum” for samples? I was thinking that when I get to 10k that might be enough since it costs time and money to transcribe the audio.
I was going to use Tacotron since it seems to be the default but maybe something else is better suited to this?