Current DeepSpeech architecture

Hello Team, I was looking at the last year blog https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/ from @reuben regarding change in Mozilla architecture. I am writing this query to confirm, if Mozilla still uses the same architecture. When I looked in the code, I realized that now DeepSpeech uses 6 layers. Could you please confirm if my understanding about current architecture is correct:

3 fully connected layers (dense) -> uni-directional RNN layer -> fully connected layer (dense) -> output layer (fully-connected)

The hidden fully connected layers use the ReLU activation. The RNN layer tanh activation.

@lissyx @reuben: Please throw some light.

Do you mind avoiding pinging people if you don’t have an answer within a few hours ? We are not all on the same time zone.

Besides, yes, we have not made new major changes to the core of the network since this blog post.

hey @lissyx,

did the architecture still the same ?

3 fully connected layers (dense) -> uni-directional RNN layer -> fully connected layer (dense) -> output layer (fully-connected)

Compared to Baidu’s architecture, which is :
3 fully connected layers (dense) -> bi-directional RNN layer -> output fully connected layer (dense)

I have trouble to understand the code.I have few questions for you:

  • what is the 5th and 6th layer (dense) for ?
  • Why did you add an other dense layer compared to Baidu’s ? (related to the first question)
  • Did the RNN just a RNN or a LSTM ?
  • why do you use different RNN between Training and Inference ?

Thank you for your time,
PS: if you have any link besides the one above, I’ll take it :slight_smile:

Mostly nothing has changed since june …

Have you read the blogpost linked earlier ? It does explains everything of the changes we did …

Which RNN are you referring to? We introduced CuDNN support recently, that requires CUDA to run, so we need to change it for other runtime. Similarly, some BlockFusedLSTM does not work on TFLite, so we change it as well at export time.

No extra layers were added compared to Baidu, we just changed the bidirectional RNN layer for a unidirectional layer. We also use LSTM instead of GRU.

We use different training and inference graphs because training is optimized for maximum throughput on training machines with GPUs, whereas the inference graph is targeted towards on-device inference with low latency.

@reuben @lissyx

thanks for your quick responses, it is clearer now.
Even after reading the blog I had some doubts, so I asked…

I’ll ask you again if I have other questions but I hope not !
Have a good day or evening :slight_smile:

Can you point what is unclear in the blogpost? Now that you got doubts fixed :slight_smile:

Well I don’t know if it’s from the blogpost or me but I understood that the choice of using uni-directionnal RNN was when you want to do some streaming.
So I get that there is 2 version of DeepSpeech, one with bi-directionnal RNN (let’s say it’s the standard version) and one with uni-directionnal RNN (streaming version). But no, it was an explanation of the modification in the architecture. That’s was my doubt.

And when i look into the code (function rnn_impl_lstmblockfusedcell), I didn’t get if it was bi or uni directionnal (I don’t know if it’s due to my lack of knowledge or if it miss a comment in this function).

My other questions was more due to my lack of knowledge on DeepSpeech. I didn’t understood that in Mozilla DeepSpeech there is 5 layers + output layer like in Baidu. What I understood was that there is 5 layers+ output layer in Baidu’s DS and 6 layers + ouput layer in Mozilla’s.