Corpus: Word Count & Audio

Hi everyone. I got 2 question on TTS dataset/corpus

My question is that:

  1. is high frequency word (number of word) in a corpus is better than low frequency word when it comes to generating an audio for a particular text?

  2. If i have the word ‘curious’ as the high-frequency word/audio in my corpus, how can the model generalize that word/audio when it comes to TTS?
    Since the word is most likely to have various different intonation during recording.

Frequency = how many times a certain word occur in our corpus

Yes in general the model works better for words crowded in the dataset,especially for languages with different pronounciation than writing. Therefore, it is better to create your dataset with a good covarage of words and phonemes and it is better to use phonemes for such languages.

1 Like

@erogol — related question:

How important is it to have a Gaussian-like distribution on clip and sentence lengths in the training corpus?

If I have a corpus with a distribution on text lengths as in the image below, what would you recommend doing? The histogram below shows numbers of characters per line of text:

Remove some shorter utterances? Leave as-is?

As you can see, the dataset has a lot of short sentences (one, two, three, and four words long).

I’m having an alignment issue during training and inference, in which the model produces only the first few words of a longer sentence… is this issue related?

-josh

I’d say the problem is related. Maybe you can sub-sample the dataset biasing towards longer sentences at the later stages of the training.

1 Like

I’ll give it a try, thanks!