My progress on expressive speech synthesis

I implemented the method of predicting style tokens from text alone as described in this paper. The method works, and the effect, while subtle, is that of a more expressive speech. Here’s an example after less than 100K steps. Sound samples. Check for example TestSentence_1.wav vs TestSentence_GST_1.wav.

The pairs of test sentences are generated by the same tacotron network. For the GST wav file, the style tokens were generated by a separate network that takes tacotron encoder output and produces style tokens. The non GST file was generated with the style token set to zero.

Authors of GST papers seem to like treating everything as attention. In particular style tokens are used as “attention” (i.e. style embedding is added to the encoder output). This means that decoder has to disentangle style and encoded text.

I tried simply concatenating encoder output and style net output and it works just fine. Can’t tell if it’s better, but I don’t see why one would ever incorporate sytle the “attention” way.

Sounds quite good. Is this your own dataset?

I use mailabs dataset for the training (mary_ann reader). It has a good quality recording and mary_ann is a dynamic reader with a nice voice.

There is a large number of librivox recordings by mary_ann available. I’m working a bit on writing a script to align more of her recordings, for use in training style token prediction from text. However, it’s going slowly because of the difficulty detecting all the corner cases where forced aligner fails.

You can also try https://github.com/mozilla/DSAlign

1 Like