Why not convolution layers instead of hidden layers?

Going through DeepSpeech2 paper,
“Bidirectional RNN models are challenging to deploy in an online, low-latency setting, because they are built to operate on an entire sample, and so it is not possible to perform the transcription process as the utterance streams from the user. We have found an unidirectional architecture that performs as well as our bidirectional models. This allows us to use unidirectional, forward-only RNN layers in our deployment system”
I understood, why Mozilla DeepSpeech architecture was changed from Bi-directional to Uni-directional.

Still I was wondering about 2 things.

  1. Why was convolution layers not tried instead of first 3 hidden layer like DeepSpeech2 architecture by Baidu. ?
  2. Is sortagrad technique applied while training the ASR model ? I do not see the training data sorted in any example data importers. Do we have to perform sorting before feeding it to training ?

Kindly clarify if I am not making sense.