Isn't an encoder-decoder more suitable model for a speech-to-text problem?

Before I start, I want to state that I’m not questioning the decisions made but rather I’m trying to learn.

Based on what I’ve read on the documentation pages, DeepSpeech is using an LSTM (along with a few fully connected layers). It’s just that there’s only one LSTM (no encoder-decoder) which means for each input, the model will generate one output and as a result, the number of elements in input and output matches. This is in contrast to an encoder-decoder design where the number of elements in input and output doesn’t necessarily match.

In a speech-to-text problem, we have a spectrogram as input and text as output and of course, the number of elements in the input is much higher than the one in output. To me, this sounds like a good candidate for an encoder-decoder model. Especially since in an encoder-decoder model, you won’t be needing CTC as the loss function which makes things easier.

Now my question is, is there any reason why an individual RNN is preferred to an encoder-decoder model for a speech-to-text problem?

Hey @mehran not sure how much context you have coming into this … but thought I’d link to this resource which is quite a good way to get the lie of the land for the (ever moving) state of the art in the ASR area

Some really intersting papers have been published against the LibreSpeech clean benchmark…
https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean. … Including plenty of sequence-to-sequence models and DeepSpeech itself back in the day. (5.39 % WER was an impressive result at the time of publication.)

1 Like