Using the newly generated language model doesn't perform as expected

DomainFlag · June 21, 2021, 8:08am

Hi!

I’ve generated KenLM 4-gram language model binary file:

python scripts/generate_lm.py --input_txt data/librispeech-lm-norm.txt.gz --output_dir . --top_k 100000 --kenlm_bins ../kenlm/build/bin/ --arpa_order 4 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Download DeepSpeech native client and generate the scorer pkg:

wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cpu.linux.tar.xz
tar xvf native_client.*.tar.xz
../deepspeech/generate_scorer_package --alphabet src/data/vocabularies/vocabulary.txt --lm lm.binary --vocab vocab-100000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

My decoder is like this:

class BeamSearchDecoder:

    def __init__(self, vocab_path, scorer_path, beam_size = 32):
        self.alphabet = Alphabet(vocab_path)
        self.scorer = Scorer(alphabet = self.alphabet, scorer_path = scorer_path, alpha = 0.931289039105002, beta = 1.1834137581510284)
        self.beam_size = beam_size

    def decode(self, outputs, seq_lengths):
        return ctc_beam_search_decoder_batch(probs_seq = outputs, seq_lengths = seq_lengths, alphabet = self.alphabet,
                                       scorer = self.scorer, beam_size = self.beam_size, num_processes = 1)


# Decoder
decoder = BeamSearchDecoder("data/vocabularies/chars.txt", "../kenlm.scorer")

And during inference (where outputs are the logarithmized probabilities):

# SxBxA => BxSxA
outputs = torch.exp(outputs).cpu().transpose(0, 1)

if decoder is not None:
    res = outputs.roll(-1, -1).numpy()
    res = decoder.decode(res, bundle.features_size.squeeze(1).cpu().numpy())

My vocabulary:

classes = " 'abcdefghijklmnopqrstuvwxyz"
text_file = open("chars.txt", "w", encoding='utf-8')
text_file.write('\n'.join(list(classes)))

The blank label is 0, that’s why outputs.roll(-1, -1) is used to roll to the left as such as the blank label is the last one now.

The results from the decoder:

Beam Search Decoder (it’s totaly wrong):
[(-2384.23876953125, ‘akdmrno’)]

Greedy Decoder:
mark my words you’ll find him cwo strong for you i and two deep

Target:
mark my words you’ll find him too strong for you aye and too deep

ftyers · June 23, 2021, 8:28pm

Do the alphabets match exactly? You can also join us on Mozilla’s Matrix to get realtime help.

DomainFlag · June 26, 2021, 7:23pm

@ftyers Everything got resolved, and yes, I think the problem was with the alphabet, especially with the file encoding. Re-did all the steps again and it worked everything well.