Differences from the original papers?

Hi,

how is the currently implemented model different from the ones described in the two Baidu papers? Is the rationale for the differences (such as using LSTM not GRU) documented?
Are changes to the architecture or optimizations planned based on those or other papers?

2 Likes