The architecture of the engine was originally motivated by that presented in Deep Speech: Scaling up end-to-end speech recognition. However, the engine currently differs in many respects from the engine it was originally motivated by. The core of the engine is a recurrent neural network (RNN) trained to ingest speech spectrograms and generate English text transcriptions.
Can anyone provide any pointers to the stated differences?(Blogs/Documentation/Publications)
Also, Can we look forward to (a compatible) RNN-T option for decoding in near future?
I apologize for any inconvenience. Surely! I will keep that in mind.
I am coming from the documentation and thought If there would be an article or a document discussing these differences explicitly other than the architecture documentation.
Thank you for the hint.
I generally avoid pinging core team. Anyways, Thanks for mentioning!