Corpus: Word Count & Audio

haqkiemdaim · July 6, 2019, 11:54pm

Hi everyone. I got 2 question on TTS dataset/corpus

My question is that:

is high frequency word (number of word) in a corpus is better than low frequency word when it comes to generating an audio for a particular text?
If i have the word ‘curious’ as the high-frequency word/audio in my corpus, how can the model generalize that word/audio when it comes to TTS?
Since the word is most likely to have various different intonation during recording.

Frequency = how many times a certain word occur in our corpus

erogol · July 8, 2019, 8:01am

Yes in general the model works better for words crowded in the dataset,especially for languages with different pronounciation than writing. Therefore, it is better to create your dataset with a good covarage of words and phonemes and it is better to use phonemes for such languages.

josh_meyer · August 22, 2019, 12:06am

@erogol — related question:

How important is it to have a Gaussian-like distribution on clip and sentence lengths in the training corpus?

If I have a corpus with a distribution on text lengths as in the image below, what would you recommend doing? The histogram below shows numbers of characters per line of text:

Remove some shorter utterances? Leave as-is?

As you can see, the dataset has a lot of short sentences (one, two, three, and four words long).

I’m having an alignment issue during training and inference, in which the model produces only the first few words of a longer sentence… is this issue related?

-josh

erogol · August 22, 2019, 1:12pm

I’d say the problem is related. Maybe you can sub-sample the dataset biasing towards longer sentences at the later stages of the training.

josh_meyer · August 22, 2019, 6:21pm

I’ll give it a try, thanks!