Understanding the input and output of the network?

I’m trying to understand the DeepSpeech model a bit better. My understanding is the input is MFCC (I don’t fully understand what that means) and the output is a prediction of a phoneme, which is then corrected via a CTC loss? I assume the language modeling is done in “post processing.” Do I have the fundamentals correct?

This is very rough/simplified description but this is how I understand it after having read the different papers:

  1. The raw audio wave is converted to a spectrogram (MFCC) which contains information about the frequency spectrum of the signal over time
  2. In 20ms steps iterate over the data
  3. For each 20ms frame which each contains information about which frequencies are contained in this frame, output a matrix which contains all of the letters from the alphabet and their corresponding probability in this frame according to the RNN’s prediction. (MFCC is needed because this way the Neuronal network has some good absolute values (features) to use as input)
  4. CTC looks at all of the frames (all of the matrices) and calculates the most probable resulting sequence of characters after applying the loss-function. (so not simply “best path” but it rather looks at all paths and calculates the most probable result since multiple “paths” can result in the same transcription)
  5. Output this Transcription

Now where exactly the language model (LM) comes into play I’m not 100% sure but I assume it must happen “together” with CTC since then the LM can help decide which paths are probably more likely then others according to its own knowledge.

Please correct me if I’m wrong. @reuben

Also note that the DeepSpeech implementation by Mozilla differs in some ways from the architecture introduced by Baidu in the Paper above (like the use if LSTM for the recurrent layer)

So what’s the actual loss? Is it CTC or the probabilities output by the RNN for the various letters?

Your explanation is correct, I just have some minor clarifications.

The probabilities are actually transition probabilities rather than character probabilities. The “blank” symbol means no transition.

CTC loss is not involved in the decoding process. The decoder is a prefix beam search process where several beams (paths through the transition probability matrix) are explored, and the most likely one is returned.

Exactly. The language model is used to re-score the beams during decoding.

Thanks for clarifying, I mistook the term “CTC loss function” for the process of the decoder where it reduces a sequence like “HHH_EL_LL_O” to “Hello”. Thought that’s what “applying the loss function” meant. I have to look into that again I guess.

Thank you kindly For @reuben and @beiserjohannes for clarifying

So the output of the network, before the decoder is what… a phoneme? Or a word?

For say a 200ms recording the rnn’s output before the decoder is 10 (10x20ms) probability matrices (transition matrix) which each contain a probability for each of the elements in the alphabet. The decoder then does it’s thing to determine which is the best transcription.

Edit: phonemes are not relevant since this is a e2e acoustic model which directly infers to letters instead of resulting in a phoneme prediction like a traditional phoneme-based model would. DeepSpeech is a grapheme based e2e model.

Got it - so the RNN outputs a probability transition matrix for each letter (a-z and space)?

Something like:

image

For each 20ms of audio?

exactly! for each 20ms

Edit:

a-z + space + blank (needed by the ctc)

1 Like

Right, so if I get aaaaaaa blank aaaaaa that turns into aa with CTC, right?

yes, see it in action here

Thank you so much. I’ve been wrestling with understanding this for a while (even after reading various papers).

1 Like