As of now, I believe following is the Deepspeech model:
26 MFCC features → Dense Layer (2048 units, default) → Dropout → Dense Layer 2048 units, default) → Dropout → Dense Layer (2048 units, default) → Dropout → LSTM (2048 units, default) → Dense Layer (2048 units, default) → Output
Now, could you please help me find out how is the decoder and LM using the output ?
Like, is the decoder using the output of this model as it’s input? If so, why can’t we get the output without the decoder as well, for that would help understand the accuracy with v/s without LM.
And finally, can someone explain/guide towards the exact decoder code written for the KenLM vocab model?