I implemented the method of predicting style tokens from text alone as described in this paper. The method works, and the effect, while subtle, is that of a more expressive speech. Here’s an example after less than 100K steps. Sound samples. Check for example TestSentence_1.wav vs TestSentence_GST_1.wav.
The pairs of test sentences are generated by the same tacotron network. For the GST wav file, the style tokens were generated by a separate network that takes tacotron encoder output and produces style tokens. The non GST file was generated with the style token set to zero.