Improving the accuracy of the speech recognition with newer iterations of deepspeech or other competing techniques in deep speech recognition!

Are you working on improving the accuracy or tips to tune it to get better accuracy with the current tool in deepspeech from Mozilla? Are you working on the latest iteration of DeepSpeech 3 from Baidu (https://arxiv.org/pdf/1707.07413.pdf)?

1 Like

Following this.

The DeepSpeech2 architecture which comprises of

  1. convolution layers instead of hidden dense layers
  2. unidirectional RNN with row convolution (future context size of 2)
    gives better accuracy than DeepSpeech1 architecture.

for example, LibriSpeech test-other, the results quoted are:
DS1 DS2 Human
21.74 13.25 12.69

Are there implementation challenges ?
Wanted to know what’s your take on this @lissyx @kdavis

We currently aren’t looking to switch to a DeepSpeech 3 or DeepSpeech 2 based architecture.

Also, our current architecture is no longer purely DeepSpeech based as we’ve made changes to allow for streaming. In particular we are RNN and not BRNN based. But we are not CNN based.

Thanks @kdavis for your reply.

I was looking into the model architecture. Yes after 0.1.1, you have switched to RNN for streaming use cases. Now you are using only forward direction cell like this.

fw_cell = tf.contrib.rnn.LSTMBlockFusedCell(Config.n_cell_dim, reuse=reuse)

DeepSpeech2 paper also quotes about challenge of BRNN. I also saw model architecture of 0.1.1 version where both forward and backward cells are implemented.

Just wanted to understand, If I wanted to change the model architecture by changing the layers say (RNN to BRNN) or by changing initial dense layers to CNN in DeepSpeech current version then what are the challenges I am going to face.
Is it that inferencing or decoding codes are dependent on model architecture ? Sorry if I asking something kiddish.

Use case - offline decoding with a greater accuracy on conversational audios.
Kindly clarify. Thanks again.

If you only want to change RNN to BRNN, I’d suggest just using an older version of our code.

If you want to go to a CNN, you’d be changing the architecture enough that you’d be on your own. In other words, none of our models, current or old, would be able to be used by you code, and any bugs you’d face would likely be specific to your code.

If all you really want is to do better on conversational audio, I’d suggest simply fine tuning the current model with conversational data. That seems by far the easiest path forward for your use case.

Thanks for your direction @kdavis.
By fine tuning, you mean training the current pre-trained model with conversation call recodings ?? Like in here -

Continuing training from a release model