The full source code for pretrained VAD streaming model

1258 · March 17, 2020, 4:03pm

Hi,

I am working on implement streaming speech recognition on FPGA/ASIC(basically translate c++/python to Verilog/VHDL and will do some hardware optimization). I decide to use DeepSpeech pretrained VAD streaming model, so I want to know where to get the source code of it (the inference part, no need for training part). Also, how to test VAD streaming pretrained model on LibriSpeech and get the WER?

Any advice would be appreciate. Thanks!

lissyx · March 17, 2020, 5:30pm

It’s all in the git repo? Have you had a look at it?

Use evaluate.py ?

lissyx · March 17, 2020, 5:36pm

That looks like. You really need to look at native_client/deepspeech.cc, but this depends on TensorFlow as well.

1258 · March 18, 2020, 3:21am

Hi, @lissyx
I know it’s all on github but I think it would be much more efficient to ask on discource. I will look at those file. Thanks for your help.

lissyx · March 18, 2020, 10:44am

Well, it would be the case if you had more precise questions. So far, all I can tell you is that everything you ask for as of now is documented. If you have more precise questions, feel free.

1258 · March 18, 2020, 7:02pm

Hello, @lissyx
Actually, I find VAD streaming ASR in DeepSpeech-Examples. But you didn’t mention it, should I figure it out?
Second, from deepspeech.cc it seems like it just stacks acoustic input and then decodes it, more like continuous (like set a window) streaming instead of VAD streaming? (I am more willing to use continuous streaming but I thought Deepspeech don’t have that.)
Lastly, I check evaluate.py but still don’t know how to test it. I now just run something like this.
deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio audio/4507-16021-0012.wav
Is there any document about running inference on a large dataset and obtain WER instead of just a .wav file?
Thanks!

lissyx · March 18, 2020, 7:05pm

Because this is not the main code, this is contributed by users.

There is no VAD code in DeepSpeech itself.

This codes just makes the interface between user, and feeds TensorFlow runtime with the values, and then does the decoding.

evaluate.py works mostly like DeepSpeech.py, you pass it --test_files and it will run WER.

1258 · March 19, 2020, 3:47am

Ok, so basically I can’t evaluate streaming model for LibriSpeech right now. Is streaming just like a demo without performance test (WER)? I am worried that streaming make the performance degrade a lot so I need to test it. 0.6.0 release claimed that deepspeech has streaming decoder so I guess deepspeech itself has a streaming model? I am kinda confused. Correct me if I am wrong. Much appreciate!

lissyx · March 19, 2020, 9:07am

Why ? There is no “streaming” specifc model, it’s the model we release.

You are confusing yourself. The claim is that the decoder is able to work with the Streaming API efficiently, which was not the case in previous releases. Now, we can call decode periodically when in the past we had to wait for the whole stream to finish to call decode.

So just pick the 0.6.1 checkpoint and run evaluation.

1258 · March 20, 2020, 5:09am

OK. So if I want to test the streaming WER performance, I have to build one by myself. It’s just both encoder and decoder are streamable. What I should do is to feed data and decode it continuously. I am worried that performance would hurt a lot, would it?(I guess the 7.5% WER version computes full sequence, although it is streamable)

lissyx · March 20, 2020, 10:01am

No no no no no and no. What is unclear in “there is no streaming specific model” ?

ALL the code uses the streaming API.

1258 · March 20, 2020, 12:12pm

No no no no no and no. What is unclear in “there is no streaming specific model” ?

My thought is I have to build one because there is no streaming specific model.

ALL the code uses the streaming API.

Which means its performance hasn’t be evaluated(it’s just doable) so I need to test it.

If you mean deepspeech simply don’t support streaming model, then how did you do it with streaming API. Like cut utterance to several chunks by VAD (or window) and feed them just like full sequence?

reuben · March 20, 2020, 12:41pm

The only API we have is a continuous streaming API. Everything is based on continuous streaming, every performance or accuracy number we have reported is based on continuous streaming. You don’t have to do anything special to use continuous streaming, just follow any of the examples we have, or read the API reference: https://deepspeech.readthedocs.io/en/v0.6.1/C-API.html

Even the convenience function we have, DS_SpeechToText, which takes an entire audio signal at once, is built on top of the continuous streaming API.

1258 · March 20, 2020, 12:59pm

I get it. Much appreciate for the help from lissyx and reuben.

1258 · March 29, 2020, 2:59pm

@lissyx
I run
./evaluate.py --export_dir ../deepspeech-0.6.1-models --test_files ../../librispeech/librivox-test-clean.csv --beam_width 1 --lm_alpha 0 --lm_beta 0 --n_steps 1

and got this error
ValueError: Cannot feed value of shape (4096,) for Tensor 'cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Initializer/Const:0', which has shape '(8192,)'

I didn’t change anything