CTC beam search only returns one result?

Wei_Zong · March 29, 2021, 10:22pm

Hi,
My Deepspeech version is 0.8.2. I used ds_ctcdecoder.ctc_beam_search_decoder_batch for decoding. Some key parameters are: beam_size=100, cutoff_prob=1.0, cutoff_top_n=40, scorer=None.

I tried to decode one audio. Ideally, the return should contain several sentences with decreasing confidence. However, it only returned ONE sentence in my case, although this sentence is correct.

Is there something wrong with this?

lissyx · March 29, 2021, 11:29pm

Have you tried increase this ? Small beam size will reduce the search space.

Wei_Zong · March 30, 2021, 1:06am

Hi,

I tried beam_size=5000 and cutoff_top_n=2000, but the result was still only one sentence. In this case, the time for decoding was about 3 seconds which was obviously slower than beam_size=100.

The input audio used was “6829-68769-0029.wav” from the Librispeech test set, and the decoding result is perfectly correct: [[(-1.4638117551803589, “it’s a stock company in rich”)]].

I did another experiment with “5142-33396-0002_400.wav”:

With beam_size=5000 and cutoff_top_n=2000, the return is [[(-5.992853164672852, ‘two hundred warriors feasted in his hall and folowed him to battle’)]].
With beam_size=100 and cutoff_top_n=40, the return is [[(-6.098922252655029, ‘two hundred warriors feasted in his hall and folowed him to battle’)]].

Therefore, the beam search algorithm indeed worked because a larger search space would lead to the result with better confidence. If this is indeed the result with the best confidence, I would be relieved because the beam search worked although it returned only the best prediction.

I was wondering whether other people would correctly get several sentences with decreasing confidence from the decoder. Or, we all get only one sentence?

lissyx · March 30, 2021, 8:21am

I’m sorry, I’m unsure whether there is an issue here. That being said, you should use 0.9.3 release, it should be compatible with 0.8 models.

Wei_Zong · March 30, 2021, 10:11am

Thank you all the same.

Just find out that tf.nn.ctc_beam_search_decoder has a parameter top_paths to control the output size. Maybe ds_ctcdecoder.ctc_beam_search_decoder_batch internally has a similar parameter to control the output size but it is not accessible. This is just a guess.

Anyway, I will upgrade to the latest release for my next project and see whether this issue persists.

lissyx · March 30, 2021, 11:46am

Wait, are you talking about inference code? Training code? You need to be more specific.

Wei_Zong · March 30, 2021, 10:31pm

I am talking about inference code. Specifically,

decoded = ctc_beam_search_decoder_batch(logits_probs, features_len, Config.alphabet, beam_size=100, num_processes=12, cutoff_prob=1.0, cutoff_top_n=40)

where logits_probs comes from tf.nn.softmax().

Then “decoded” only contains one sentence. e.g. [[(-1.4638117551803589, “it’s a stock company in rich”)]]

I was wondering whether someone can correctly get a list of sentences with decreasing confidence.