Hi there, I’ve been trying to build a custom language model for my specific use case since I have a lot of text data to work with. I’m doing this on release 0.6.1.
What I’ve done so far:
- I’ve collected all of my transcriptions into a corpus.txt file, where the only characters are a-z, apostrophes, and whitespace (just like the example under ./data/lm/)
- I’ve collected the native_client.tar.gz binaries for my system (macos) from https://community-tc.services.mozilla.com/tasks/XNAo-uNoT8GU8E43P1PTzQ#artifacts and copied them under data/lm/
- I’ve built a shell script to generate the lm.binary and trie files using the flags I saw in the data/lm/ example:
…/…/DeepSpeech/data/lm/lmplz --order 5 --memory 50% --prune 0 0 1 --arpa words.arpa <corpus.txt
…/…/DeepSpeech/data/lm/build_binary -a 255 -q 8 trie words.arpa lm.binary
…/…/DeepSpeech/data/lm/generate_trie alphabet.txt lm.binary trie - I had to generate the alphabet.txt myself, and this may be the problem, but I guessed it should be a txt file containing a-z, an apostrophe, and a whitespace, all on separate lines.
The output I’m getting is stuff like "a a a a a ", “snl shb”, “chrhmbdmshuhyhmf”, while the expected outputs are full paragraphs. The audio files I’m passing in are about 40 seconds each. I’m able to get good outputs using the pre-built lm.binary and trie, so my test setup seems fine.
Does anyone have any ideas what could be going wrong? Is my alphabet.txt set up correctly? Thanks in advance.
edit: the whitespace line in my alphabet.txt didn’t actually have a space in it, so i added that, ran it again, and now my output is a different kind of gibberish, here’s a snippit: …“lad’ng’sle’epr’sar’her’sft’rtr’ken’ort’tre’mll’onur’sof’sle’p’tco’bie’stad’ynmtkhm’han’eou’rre’ire’oer’itlan’ors’lee’hab”…