Letter based language model


(Mikel Penagarikano) #1

Hi,

Is it possible to use a letter based language model (i.e. letter 5-grams for example) with deepspeech?

I tried to train a letter based language model with KenLM toolkit, but did not suceed. Then I trained it with sri-lm (worked), but when I try to create the trie it fails, both if I create first the binary LM or not:

root@b9ba16c8d6a1:/DeepSpeech# /DeepSpeech/native_client/kenlm/build/bin/build_binary lm.arpa lm.binary
Reading lm.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
The ARPA file is missing . Substituting log10 probability -100.


SUCCESS

/DeepSpeech/native_client/generate_trie alphabet.txt lm.binary trie

Segmentation fault (core dumped)

/DeepSpeech/native_client/generate_trie alphabet.txt lm.arpa trie
Loading the LM will be faster if you build a binary file.
Reading lm.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
The ARPA file is missing . Substituting log10 probability -100.


Segmentation fault (core dumped)


(mathematiguy) #2

Can we bump this question up the list?

I’m working on incorporating some orthographical character rules to the language model (e.g. no consecutive consonant clusters) which are specific to my language use-case, and my understanding is that a character based language model might make it possible to restrict the language model in those specific ways.

Or if there’s another way to achieve this goal, I could consider that as well…