With reference to the following:
The architecture of the engine was originally motivated by that presented in Deep Speech: Scaling up end-to-end speech recognition. However, the engine currently differs in many respects from the engine it was originally motivated by. The core of the engine is a recurrent neural network (RNN) trained to ingest speech spectrograms and generate English text transcriptions.
Can anyone provide any pointers to the stated differences?(Blogs/Documentation/Publications)
Also, Can we look forward to (a compatible) RNN-T option for decoding in near future?