Hello Team, I was looking at the last year blog https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/ from @reuben regarding change in Mozilla architecture. I am writing this query to confirm, if Mozilla still uses the same architecture. When I looked in the code, I realized that now DeepSpeech uses 6 layers. Could you please confirm if my understanding about current architecture is correct:
Compared to Baidu’s architecture, which is : 3 fully connected layers (dense) -> bi-directional RNN layer -> output fully connected layer (dense)
I have trouble to understand the code.I have few questions for you:
what is the 5th and 6th layer (dense) for ?
Why did you add an other dense layer compared to Baidu’s ? (related to the first question)
Did the RNN just a RNN or a LSTM ?
why do you use different RNN between Training and Inference ?
Thank you for your time,
PS: if you have any link besides the one above, I’ll take it
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Mostly nothing has changed since june …
Have you read the blogpost linked earlier ? It does explains everything of the changes we did …
Which RNN are you referring to? We introduced CuDNN support recently, that requires CUDA to run, so we need to change it for other runtime. Similarly, some BlockFusedLSTM does not work on TFLite, so we change it as well at export time.
No extra layers were added compared to Baidu, we just changed the bidirectional RNN layer for a unidirectional layer. We also use LSTM instead of GRU.
We use different training and inference graphs because training is optimized for maximum throughput on training machines with GPUs, whereas the inference graph is targeted towards on-device inference with low latency.
Well I don’t know if it’s from the blogpost or me but I understood that the choice of using uni-directionnal RNN was when you want to do some streaming.
So I get that there is 2 version of DeepSpeech, one with bi-directionnal RNN (let’s say it’s the standard version) and one with uni-directionnal RNN (streaming version). But no, it was an explanation of the modification in the architecture. That’s was my doubt.
And when i look into the code (function rnn_impl_lstmblockfusedcell), I didn’t get if it was bi or uni directionnal (I don’t know if it’s due to my lack of knowledge or if it miss a comment in this function).
My other questions was more due to my lack of knowledge on DeepSpeech. I didn’t understood that in Mozilla DeepSpeech there is 5 layers + output layer like in Baidu. What I understood was that there is 5 layers+ output layer in Baidu’s DS and 6 layers + ouput layer in Mozilla’s.