Advanced TTS Techniques

Hello. First of all I would like to thank you all for your efforts. The demo voices
really sound great!

For my project, a game with many AI characters, I am looking for suggestions on how
the following might be achieved:

1- TTS for a lot of different voices: male, female, young, adolescent, adult, old, sick, fantasy & sci-fi (monsters, aliens).
Could I use a few base voices and do some kind of morphing on the speaker embedding?
Or could the speaker embedding be randomly generated based on some range?

2- Control intonations and emotions.

3- Introduce foreign accents and speech defects or just variations.
By variations I mean the length, pitch, emphasis of some syllables. Sometimes only on some specific word.

4- Other sounds that humans sometime do: sigh, sneeze, breath, breath while talking, clear throat, cough, burp, argh!, hum, hyper-ventilate.

5- More complex sounds like crying, laughing, humming, whistling, singing.

6- Could animal ‘voices’ be simulated too? (bird chirp/sing, cat, dog).

Anyone can point me in the right direction for one or more points above?

Or perhaps direct me to another link/page/forum more related to my questions?

Sounds like you want to replace voice and foley artists with a tool that does all this for you. For now you’d be better served by using those folks if you want to have a high quality audio side.

For #1, you can look up (here, google, github, arxiv) transfer learning and multispeaker tacotron implementations, that might get you close. #2, there’s some emotional versions, again you’d need to look up what’s involved in that. #3-6, no idea.