How is the end-to-end model coupled with CTC decoder

sayantangangs.91 · January 28, 2020, 7:36am

As of now, I believe following is the Deepspeech model:

26 MFCC features → Dense Layer (2048 units, default) → Dropout → Dense Layer 2048 units, default) → Dropout → Dense Layer (2048 units, default) → Dropout → LSTM (2048 units, default) → Dense Layer (2048 units, default) → Output

Now, could you please help me find out how is the decoder and LM using the output ?
Like, is the decoder using the output of this model as it’s input? If so, why can’t we get the output without the decoder as well, for that would help understand the accuracy with v/s without LM.

And finally, can someone explain/guide towards the exact decoder code written for the KenLM vocab model?

lissyx · January 28, 2020, 8:23am

Have you had a look at the source code ? That’s the best place to find out.

I’ll let your read and understand that: https://github.com/mozilla/DeepSpeech/blob/1d6a337ab4ab6fdf0457d7da866ded0584334755/DeepSpeech.py#L934-L936

sayantangangs.91 · January 28, 2020, 8:29am

While I didn’t go through the ctc_decoder part in details. I shall go through that and revert thereafter