Does GST layer consider sequence length?

iclementine · April 8, 2021, 10:45am

I noticed that The GST Layer (or reference encoder) does not take sequence length into consideration.

mozilla/TTS/blob/e9e07844b77a43fb0864354791fb4cf72ffded11/TTS/tts/layers/gst_layers.py#L27


    def forward(self, inputs, speaker_embedding=None):
        enc_out = self.encoder(inputs)
        # concat speaker_embedding
        if speaker_embedding is not None:
            enc_out = torch.cat([enc_out, speaker_embedding], dim=-1)
        style_embed = self.style_token_layer(enc_out)

        return style_embed


class ReferenceEncoder(nn.Module):
    """NN module creating a fixed size prosody embedding from a spectrogram.

    inputs: mel spectrograms [batch_size, num_spec_frames, num_mel]
    outputs: [batch_size, embedding_dim]
    """

    def __init__(self, num_mel, embedding_dim):

        super().__init__()
        self.num_mel = num_mel

Assume the input spectrogram has shape (B, T, C), after several convolutions and a GRU, the final state of the GRU is returned. But the batched spectrogram has different valid lengths, (some short spectrograms are padded), should we take care of the valid lengths and get the final state (according to the valid length of each spectrogram)?

Thank you.