I noticed that The GST Layer (or reference encoder) does not take sequence length into consideration.
Assume the input spectrogram has shape (B, T, C), after several convolutions and a GRU, the final state of the GRU is returned. But the batched spectrogram has different valid lengths, (some short spectrograms are padded), should we take care of the valid lengths and get the final state (according to the valid length of each spectrogram)?