Why not convolution layers instead of hidden layers?

abhijeetchar · December 3, 2018, 4:46am

Going through DeepSpeech2 paper,
“Bidirectional RNN models are challenging to deploy in an online, low-latency setting, because they are built to operate on an entire sample, and so it is not possible to perform the transcription process as the utterance streams from the user. We have found an unidirectional architecture that performs as well as our bidirectional models. This allows us to use unidirectional, forward-only RNN layers in our deployment system”
I understood, why Mozilla DeepSpeech architecture was changed from Bi-directional to Uni-directional.

Still I was wondering about 2 things.

Why was convolution layers not tried instead of first 3 hidden layer like DeepSpeech2 architecture by Baidu. ?
Is sortagrad technique applied while training the ASR model ? I do not see the training data sorted in any example data importers. Do we have to perform sorting before feeding it to training ?

Kindly clarify if I am not making sense.

Topic		Replies	Views
Current DeepSpeech architecture DeepSpeech	9	3314	August 20, 2019
Improving the accuracy of the speech recognition with newer iterations of deepspeech or other competing techniques in deep speech recognition! DeepSpeech	5	3460	December 7, 2018
Deep speech Uni/Bidirectional LSTM? DeepSpeech	8	1211	February 2, 2021
Difference in DeepSpeech document and DeepSpeech article DeepSpeech	4	386	May 14, 2021
V 0.2.0 is based on deepspeech 2? DeepSpeech	3	858	October 5, 2018

Why not convolution layers instead of hidden layers?

Related topics