How to make the testing process more quickly?

(jackhuang) #1

It takes a long time to test one .wav file(not using language is about 20s, using the language model is about 240s), I wander how to make this process more quickly?

(Lissyx) #2

Even 20 secs for a 10 secs audio file is suspect. Can you document your system and how you install everything ?

(jackhuang) #3

When I train the English model, the test process is much quickly. Now, I am using the Deepspeech to train a Chinese model. I didn’t change any code, and only using the Chinese training data and language model(alphabet size is about 6000, using 10600000 Chinese sentences(700M) to train a 4-gram model on Chinese character level) to train the model. The testing process is much slower than before.

(Lissyx) #4

Thanks. This is something we have not yet worked on, it could just be a fallout from the increased complexity because you have much more characters. Maybe you should try to play with some of the language model’s parameters, like beam width?

Also, trying different n-gram ?

(jackhuang) #5

Does it mean that the larger the beam width is, the more will the model generate the candidate transcriptions and this relationship is linear?

(Lissyx) #6

It means you are exploring things we have not, and one of the parameter I see that might influence speed, besides the size of your alphabet (for which you cannot do anything, of course) is the beam width. But you might also have to change the n-gram used (I cannot remember what we are using right now). I tend to remember much faster than 5x when using beam width of 100 (but that comes with a cost in word error rate as well).

There’s also something spurious: 20 secs for 10 secs of audio on GTX1080, when we measured ~2x realtime for GTX1070 on English, I’m wondering if it’s really the best that one can get for Chinese or if you have room for improvement as well.

(Reuben Morais) #7

You could try using Baidu’s Warp-CTC. It’s specifically meant to handle large alphabet sizes well.

(Lissyx) #8

To complement this @jackhuang we have WarpCTC in master branch (it’s TensorFlow r1.4), though we stopped building Python packages. Also, build instructions have been updated to use --config=monolithic and --copt=-fvisibility=hidden, if you build the Python package you might not want that.

Using WarpCTC from libdeepspeech should not be too hard though, likely just have to use the proper headers and switch the CTC codepath to use it.

(jackhuang) #9

Did you mean the warp_ctc function in tensorflow (
and Baidu’s Warp-CTC has the same effect?

(Lissyx) #10

This is the code from Baidu, actually. There was an old TensorFlow fork with up, we’ve taken that and kept it on more uptodate. It may work, but there may be issues :slight_smile:

(jackhuang) #11

And can the native client of the deepspeech which is installed by pip also use the WarpCTC?

(Lissyx) #12

Not like that, you’d have to make the same kind of changes. You need to do the changes in ( and then rebuild the python / node / C++ packages, this will be picked up.

(jackhuang) #13

Would you please give me some advice on processing the training data when I trained the language model with the Chinese corpus like “今 天 早 上 …”, of which the sentence is divided into character level by (so the n-gram model is based on Chinese character). Should the training data be like “明天中午”(not divided by blank) or “明 天 中 午”(divided by blank)?

(Lissyx) #14

I’m sorry, but what is the difference in vietnamese between the two alternatives? Basically, you should train with what should be the output as you use daily.

(jackhuang) #15

Well, I am training the Chinese model. In Chinese, the word will not be separated by the blank. For example, “今天上午吃早餐” is a normal Chinese sentence. But in English, the word will be separated by the blank. So, I wander whether I need to separate the Chinese sentence when using the deepspeech to train the Chinese model.
By the way, I try to know the process of decoding with kenlm, but I only find a “” file. Could you please show me the original code of that file so I can get to know how that code works.

(Lissyx) #16

For the spacing, I don’t think it should be a problem. The decoding code is TensorFlow’s CTC, libctc_decoder_with_kenlm is a custom op to add KenLM scoring to the TensorFlow CTC Beam Decoder.

(Lissyx) #17

Here in TensorFlow source:

  • tensorflow/core/util/ctc/ctc_beam_search.h
  • tensorflow/core/kernels/

(Reuben Morais) #18

And a description of the algorithm itself is here:

(Reuben Morais) #19

And an explanation of beam search decoding:

(apertus) #21

Hi @lissyx @jackhuang I think I’ve found the reason why it take long time in testing process .
In DeepSpeech2 paper had mentioned that beam search had been further prune for Mandarin.
Here’s the capture of the paper: (In 7.3)
“Rather than considering all characters as viable additions to the beam, we only consider the fewest number of characters whose cumulative probability is at least p.”
@lissyx will you mind to give me some advice on which class I need to modify? I’ve take a quick look on and I’m not sure that KenLMBeamScorer is the actual class I should modify or not.

Thanks! And sorry for my poor English.