The full source code for pretrained VAD streaming model


I am working on implement streaming speech recognition on FPGA/ASIC(basically translate c++/python to Verilog/VHDL and will do some hardware optimization). I decide to use DeepSpeech pretrained VAD streaming model, so I want to know where to get the source code of it (the inference part, no need for training part). Also, how to test VAD streaming pretrained model on LibriSpeech and get the WER?

Any advice would be appreciate. Thanks!

It’s all in the git repo? Have you had a look at it?

Use ?

That looks like. You really need to look at native_client/, but this depends on TensorFlow as well.

Hi, @lissyx
I know it’s all on github but I think it would be much more efficient to ask on discource. I will look at those file. Thanks for your help.

Well, it would be the case if you had more precise questions. So far, all I can tell you is that everything you ask for as of now is documented. If you have more precise questions, feel free.

Hello, @lissyx
Actually, I find VAD streaming ASR in DeepSpeech-Examples. But you didn’t mention it, should I figure it out?
Second, from it seems like it just stacks acoustic input and then decodes it, more like continuous (like set a window) streaming instead of VAD streaming? (I am more willing to use continuous streaming but I thought Deepspeech don’t have that.)
Lastly, I check but still don’t know how to test it. I now just run something like this.
deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio audio/4507-16021-0012.wav
Is there any document about running inference on a large dataset and obtain WER instead of just a .wav file?

Because this is not the main code, this is contributed by users.

There is no VAD code in DeepSpeech itself.

This codes just makes the interface between user, and feeds TensorFlow runtime with the values, and then does the decoding. works mostly like, you pass it --test_files and it will run WER.

Ok, so basically I can’t evaluate streaming model for LibriSpeech right now. Is streaming just like a demo without performance test (WER)? I am worried that streaming make the performance degrade a lot so I need to test it. 0.6.0 release claimed that deepspeech has streaming decoder so I guess deepspeech itself has a streaming model? I am kinda confused. Correct me if I am wrong. Much appreciate!

Why ? There is no “streaming” specifc model, it’s the model we release.

You are confusing yourself. The claim is that the decoder is able to work with the Streaming API efficiently, which was not the case in previous releases. Now, we can call decode periodically when in the past we had to wait for the whole stream to finish to call decode.

So just pick the 0.6.1 checkpoint and run evaluation.

OK. So if I want to test the streaming WER performance, I have to build one by myself. It’s just both encoder and decoder are streamable. What I should do is to feed data and decode it continuously. I am worried that performance would hurt a lot, would it?(I guess the 7.5% WER version computes full sequence, although it is streamable)

No no no no no and no. What is unclear in “there is no streaming specific model” ?

ALL the code uses the streaming API.

No no no no no and no. What is unclear in “there is no streaming specific model” ?

My thought is I have to build one because there is no streaming specific model.

ALL the code uses the streaming API.

Which means its performance hasn’t be evaluated(it’s just doable) so I need to test it.

If you mean deepspeech simply don’t support streaming model, then how did you do it with streaming API. Like cut utterance to several chunks by VAD (or window) and feed them just like full sequence?

The only API we have is a continuous streaming API. Everything is based on continuous streaming, every performance or accuracy number we have reported is based on continuous streaming. You don’t have to do anything special to use continuous streaming, just follow any of the examples we have, or read the API reference:

Even the convenience function we have, DS_SpeechToText, which takes an entire audio signal at once, is built on top of the continuous streaming API.


I get it. Much appreciate for the help from lissyx and reuben.

1 Like

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

I run
./ --export_dir ../deepspeech-0.6.1-models --test_files ../../librispeech/librivox-test-clean.csv --beam_width 1 --lm_alpha 0 --lm_beta 0 --n_steps 1

and got this error
ValueError: Cannot feed value of shape (4096,) for Tensor 'cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Initializer/Const:0', which has shape '(8192,)'

I didn’t change anything