I am trying training using GST and it works very well. I find it makes the speech much more expressive and it can learn a few different styles; however, I would like to make the pauses in between sentences longer and even though my dataset has the exact same duration of pause mid periods, I haven’t still figured out how to do it, as mentioned here https://github.com/mozilla/TTS/issues/167. My idea was that I could try multiplying the tensors by a value, but nothing really worked. Ideally, the style would stay the same, but the speech and pause would change. My point is that I do not really know how to disentangle these.
If you want to condition on tokens, you will need to query your gst layer for the specific token, which you can do as follows.
init the query: gst_embedding_dim = embedding dim of your gst layer (like 256 or 512)
query = torch.zeros(1, 1, self.gst_embedding_dim//2).to(device)
get your style tokens:
_GST = torch.tanh(self.gst_layer.style_token_layer.style_tokens)
get the key for a specific token where TOKEN is some int value
key = _GST[TOKEN].unsqueeze(0).expand(1, -1, -1)
now you can query your gst attention layer for the specified key:
gst_outputs_att = self.gst_layer.style_token_layer.attention(query, key)
this gst_outputs_att [tensor] can now be modified by multiplying:
gst_outputs = gst_outputs_att * 0.15 #example
from here you compute it just as if you were to condition on a wav file.
Hope this helps.
Check this commit https://github.com/mozilla/TTS/pull/451/commits/c5eaf12d7b9b5b74b99a633a5622a44b123e49e4# for an example.