Use Wikipedia to build language model - Issue with size of txt file


(SGang) #1

I am using Deepspeech for India English identification and I’m currently using 0.3.0 model (official release).

With the existing LM, the words are not being generated properly, although the dialect is catching up. I am training from the checkpoint with India specific voice.

For the LM, I have downloaded Wikipedia Dump (English). On treating the data (removing spaces, punctuation, etc) the size has come up to 14.4 GB (of text) of “txt” file. Now, I have tried generating LM (using KenLM) on a few 110-150MB data and the resulting ARPA file is about 1,2-1.6GB (& binary files 50% of that size). Now, is it practical/possible to use 14 GB of text file to generate the necessary LM? If that’s the case, do we need a 100GB RAM? (ARPA/Binary files must reside in RAM while being used by DeepSpeech, right?)

Kindly suggest possible remedy and how to proceed?

P.S.: I’ll have to add about 1 GB of text in future for some field specific texts