Hi everyone.
I’m trying to implement my own version of Tacotron based on the paper, which says “The convolution outputs are stacked together and further max pooled along time to increase local invariances. Note that we use a stride of 1 to preserve the original time resolution”. I don’t understand, isn’t using a stride of 1 to max-pool along time equivalent to doing nothing?