How do we scale tokens during inference time for silence prolongation?

georroussos · July 16, 2020, 2:51pm

I am trying training using GST and it works very well. I find it makes the speech much more expressive and it can learn a few different styles; however, I would like to make the pauses in between sentences longer and even though my dataset has the exact same duration of pause mid periods, I haven’t still figured out how to do it, as mentioned here https://github.com/mozilla/TTS/issues/167. My idea was that I could try multiplying the tensors by a value, but nothing really worked. Ideally, the style would stay the same, but the speech and pause would change. My point is that I do not really know how to disentangle these.

sanjaesc · July 17, 2020, 9:47am

If you want to condition on tokens, you will need to query your gst layer for the specific token, which you can do as follows.

init the query: gst_embedding_dim = embedding dim of your gst layer (like 256 or 512)

query = torch.zeros(1, 1, self.gst_embedding_dim//2).to(device)

get your style tokens:

_GST = torch.tanh(self.gst_layer.style_token_layer.style_tokens)

get the key for a specific token where TOKEN is some int value

key = _GST[TOKEN].unsqueeze(0).expand(1, -1, -1)

now you can query your gst attention layer for the specified key:

gst_outputs_att = self.gst_layer.style_token_layer.attention(query, key)

this gst_outputs_att [tensor] can now be modified by multiplying:

gst_outputs = gst_outputs_att * 0.15 #example

from here you compute it just as if you were to condition on a wav file.

Hope this helps.

Check this commit Support for Mutlispeaker Tacotron2 with GST and the ability to condition on tokens by lexkoro · Pull Request #451 · mozilla/TTS · GitHub for an example.

Topic		Replies	Views
My progress on expressive speech synthesis TTS (Text-to-Speech)	4	1028	September 18, 2019
Does GST layer consider sequence length? TTS (Text-to-Speech)	0	407	April 8, 2021
Any plans for SSML, prosody control; GST TTS (Text-to-Speech)	0	762	September 24, 2019
How to control pause between phonemes by pause token TTS (Text-to-Speech)	0	448	March 4, 2020
Any step by step how-to/documentation on synthesizing with a pre-trained model? TTS (Text-to-Speech)	3	1567	September 25, 2019

How do we scale tokens during inference time for silence prolongation?

Related topics