[Solved] Help with custom language model, output is gibberish

alex_cannan · April 9, 2020, 7:12pm

Hi there, I’ve been trying to build a custom language model for my specific use case since I have a lot of text data to work with. I’m doing this on release 0.6.1.

What I’ve done so far:

I’ve collected all of my transcriptions into a corpus.txt file, where the only characters are a-z, apostrophes, and whitespace (just like the example under ./data/lm/)
I’ve collected the native_client.tar.gz binaries for my system (macos) from https://community-tc.services.mozilla.com/tasks/XNAo-uNoT8GU8E43P1PTzQ#artifacts and copied them under data/lm/
I’ve built a shell script to generate the lm.binary and trie files using the flags I saw in the data/lm/ example:
…/…/DeepSpeech/data/lm/lmplz --order 5 --memory 50% --prune 0 0 1 --arpa words.arpa <corpus.txt
…/…/DeepSpeech/data/lm/build_binary -a 255 -q 8 trie words.arpa lm.binary
…/…/DeepSpeech/data/lm/generate_trie alphabet.txt lm.binary trie
I had to generate the alphabet.txt myself, and this may be the problem, but I guessed it should be a txt file containing a-z, an apostrophe, and a whitespace, all on separate lines.

The output I’m getting is stuff like "a a a a a ", “snl shb”, “chrhmbdmshuhyhmf”, while the expected outputs are full paragraphs. The audio files I’m passing in are about 40 seconds each. I’m able to get good outputs using the pre-built lm.binary and trie, so my test setup seems fine.

Does anyone have any ideas what could be going wrong? Is my alphabet.txt set up correctly? Thanks in advance.

edit: the whitespace line in my alphabet.txt didn’t actually have a space in it, so i added that, ran it again, and now my output is a different kind of gibberish, here’s a snippit: …“lad’ng’sle’epr’sar’her’sft’rtr’ken’ort’tre’mll’onur’sof’sle’p’tco’bie’stad’ynmtkhm’han’eou’rre’ire’oer’itlan’ors’lee’hab”…

lissyx · April 9, 2020, 7:19pm

You need to have matching alphabet between acoustic model and LM. Not doing it will likely result in this kind of broken output.

alex_cannan · April 9, 2020, 7:24pm

Do you know where I could find the matching alphabet.txt for the prebuilt 0.6.1 models? I couldn’t find any when I looked

nevermind… just found it https://github.com/mozilla/DeepSpeech/blob/v0.6.1/data/alphabet.txt

lissyx · April 9, 2020, 7:26pm

It’s the alphabet file under data/

alex_cannan · April 9, 2020, 7:29pm

The output is much better now, thank you @lissyx !

Topic		Replies	Views
Issue with Language Model DeepSpeech	11	1035	January 3, 2019
How to Strict the output to the Language Model only? DeepSpeech	6	1439	July 29, 2018
Fine tune the Language Model DeepSpeech	3	496	December 6, 2019
Changing alphabet.txt for the Language Model DeepSpeech	2	2016	January 8, 2019
Fine tuning the language model DeepSpeech	3	1632	October 11, 2018

[Solved] Help with custom language model, output is gibberish

Related topics