Hi everyone. I got 2 question on TTS dataset/corpus
My question is that:
is high frequency word (number of word) in a corpus is better than low frequency word when it comes to generating an audio for a particular text?
If i have the word ‘curious’ as the high-frequency word/audio in my corpus, how can the model generalize that word/audio when it comes to TTS?
Since the word is most likely to have various different intonation during recording.
Frequency = how many times a certain word occur in our corpus
Yes in general the model works better for words crowded in the dataset,especially for languages with different pronounciation than writing. Therefore, it is better to create your dataset with a good covarage of words and phonemes and it is better to use phonemes for such languages.
How important is it to have a Gaussian-like distribution on clip and sentence lengths in the training corpus?
If I have a corpus with a distribution on text lengths as in the image below, what would you recommend doing? The histogram below shows numbers of characters per line of text:
As you can see, the dataset has a lot of short sentences (one, two, three, and four words long).
I’m having an alignment issue during training and inference, in which the model produces only the first few words of a longer sentence… is this issue related?