@erogol I looked into the implementation of Graves attention from the Battenberg paper and I think it’s wrong in the dev branch. It is using softplus for the mean term (as in V2 model from the paper) and exponential for the variance (as in V1 model). When I train with the current dev branch I get major attention artifacts during inference - it basically doesn’t work.
The implementation of V2b model from the paper that was in the dev branch before Nov 8 is also incorrect. It is multiplying by the variance instead of dividing in the exponential term (phi_t = g_t * torch.exp(-0.5 * sig_t * (mu_t_ - j)**2), when it should be phi_t = g_t * torch.exp(-0.5 * (mu_t_ - j)**2 / sig_t) ).
With the correct implementation of V2b model, the attention seems to work really well. It coverges quickly. More importantly, it doesn’t stutter and repeat with long difficult sentences.
I saw somewhere that you weren’t happy with GMM attention performance. Were you using the correct implementation during your evaluation?