Hello everyone!
I prepare ds for the TTS. I want the model to pronounce question sentences clearly question-like, I mean, I english we can realize that we’ve just been asked because of at least a special order of words, while in some other languages there are no such restrictions, the recipe is just add “?” at the end; when speaking the intonation will make clear that the question has just been ascked.
So, I want to have around 25k sentences with various length ending sometimes with ‘.’, sometimes with ‘?’, sometimes with ‘!’.
Will the TTS learn to pronounce it right? How many ?-sentences do I need then?
Now I have around 315, which seems to me too little. Should it be something around 2k or so?
One more question.
Let’s assume we haven’t added one quite hard to pronounce word in the dataset, so the TTS makes misstakes on that one. To fix it, clearly, that word should be added to the dataset, right? But how many times should it occur (roughly)?
To answer your question about special characters(like ? And ,) influencing the TTS to help with intonations, Yes it works exactly like that. You will get the expected prosody for questions and pauses for commas if they are in the dataset and the dataset is consistent with their audio.

The other questions about how many is very hard to answer. You’d have to find what works empirically, or add enough to be proportionately to not worry about it.


If you try fixing problems like this, you are more likely to add bias than fix the problem. Like I said, empirically find it, or add a ton of data hope it works.

