Are there any plans for SSML leveraging? It would be great to be able to change pauses, emphasize specific parts of the sentences and so on.
GST model supposed to encode specific style of prosody as in the target audio, right? You have added the TacortonGST model, but how to use it? How to provide the target audio, and what are requirements for it (e.g. speaker must be the same, length of the target audio must be roughly the same as the synthesized one and so on).
BTW, still can’t make GST as well as Tacotron2 models to learn (issue on github https://github.com/mozilla/TTS/issues/287)