I tried beam_size=5000 and cutoff_top_n=2000, but the result was still only one sentence. In this case, the time for decoding was about 3 seconds which was obviously slower than beam_size=100.
The input audio used was “6829-68769-0029.wav” from the Librispeech test set, and the decoding result is perfectly correct: [[(-1.4638117551803589, “it’s a stock company in rich”)]].
I did another experiment with “5142-33396-0002_400.wav”:
- With beam_size=5000 and cutoff_top_n=2000, the return is [[(-5.992853164672852, ‘two hundred warriors feasted in his hall and folowed him to battle’)]].
- With beam_size=100 and cutoff_top_n=40, the return is [[(-6.098922252655029, ‘two hundred warriors feasted in his hall and folowed him to battle’)]].
Therefore, the beam search algorithm indeed worked because a larger search space would lead to the result with better confidence. If this is indeed the result with the best confidence, I would be relieved because the beam search worked although it returned only the best prediction.
I was wondering whether other people would correctly get several sentences with decreasing confidence from the decoder. Or, we all get only one sentence?