Graves Attention

@erogol I looked into the implementation of Graves attention from the Battenberg paper and I think it’s wrong in the dev branch. It is using softplus for the mean term (as in V2 model from the paper) and exponential for the variance (as in V1 model). When I train with the current dev branch I get major attention artifacts during inference - it basically doesn’t work.

The implementation of V2b model from the paper that was in the dev branch before Nov 8 is also incorrect. It is multiplying by the variance instead of dividing in the exponential term (phi_t = g_t * torch.exp(-0.5 * sig_t * (mu_t_ - j)**2), when it should be phi_t = g_t * torch.exp(-0.5 * (mu_t_ - j)**2 / sig_t) ).

With the correct implementation of V2b model, the attention seems to work really well. It coverges quickly. More importantly, it doesn’t stutter and repeat with long difficult sentences.

I saw somewhere that you weren’t happy with GMM attention performance. Were you using the correct implementation during your evaluation?

1 Like

Wrong is a strong statement but yes it is different. It did not work as in the paper so I changed it a bit. Their version was aggregating too much value at initial steps and was not diffusing it. So I changed it as after doing some empirical checks. The last time, my version for it worked quite good for ljspeech but might be different in your case. Please send a PR and let me check.

1 Like