My progress on expressive speech synthesis

geneing · September 14, 2019, 8:16pm

I implemented the method of predicting style tokens from text alone as described in this paper. The method works, and the effect, while subtle, is that of a more expressive speech. Here’s an example after less than 100K steps. Sound samples. Check for example TestSentence_1.wav vs TestSentence_GST_1.wav.

The pairs of test sentences are generated by the same tacotron network. For the GST wav file, the style tokens were generated by a separate network that takes tacotron encoder output and produces style tokens. The non GST file was generated with the style token set to zero.

geneing · September 17, 2019, 4:45am

Authors of GST papers seem to like treating everything as attention. In particular style tokens are used as “attention” (i.e. style embedding is added to the encoder output). This means that decoder has to disentangle style and encoded text.

I tried simply concatenating encoder output and style net output and it works just fine. Can’t tell if it’s better, but I don’t see why one would ever incorporate sytle the “attention” way.

erogol · September 18, 2019, 2:22pm

Sounds quite good. Is this your own dataset?

geneing · September 18, 2019, 6:16pm

I use mailabs dataset for the training (mary_ann reader). It has a good quality recording and mary_ann is a dynamic reader with a nice voice.

There is a large number of librivox recordings by mary_ann available. I’m working a bit on writing a script to align more of her recordings, for use in training style token prediction from text. However, it’s going slowly because of the difficulty detecting all the corner cases where forced aligner fails.

erogol · September 18, 2019, 7:56pm

You can also try https://github.com/mozilla/DSAlign

Topic		Replies	Views
Tacotron-gst branch TTS (Text-to-Speech)	1	551	July 10, 2019
Any plans for SSML, prosody control; GST TTS (Text-to-Speech)	0	762	September 24, 2019
How do we scale tokens during inference time for silence prolongation? TTS (Text-to-Speech)	1	385	July 17, 2020
Syntethic voices generation in Hindi DeepSpeech	5	728	April 15, 2020
The challenges of aligning spoken word to text unobserved TTS (Text-to-Speech)	6	828	January 8, 2021

My progress on expressive speech synthesis

Related topics