Availability of lm.binary ARPA version

william.van.woensel · January 26, 2019, 3:26pm

I was wondering whether the original ARPA version of lm.binary (deepspeech-0.4.1-models.tar.gz) is available for download.

I’m looking into integrating a task-specific language model using SRILM but that requires the original ARPA version of the model. I understand from here that the text file from which the model was generated is not available due to licensing issues - but does that preclude the textual ARPA model from being published as well? Although there seemed to be some talk about using a publicly available training data set - was LibriSpeech eventually used for training the language model?

I tried converting the binary into the ARPA version but it doesn’t seem like kenlm supports it. I’ve tried the instructions here but I can no(t) (longer) find “vocab.txt” in the repo … (?)

I’ve tried some other utilities (e.g., sphinx_lm_convert) but they didn’t recognize the binary format (both for the trie and lm.binary).

kdavis · January 26, 2019, 3:45pm

This is no longer the case.

You can re-create the ARPA using the instructions here.

syoon9 · November 14, 2019, 10:02am

I have the same problem; i am looking for .arpa file of deepspeech language model. The above link for the instruction does not work any longer. could you please update the link to the instruction?

lissyx · November 14, 2019, 10:17am

github.com

mozilla/DeepSpeech/blob/master/data/lm/README.rst


lm.binary was generated from the LibriSpeech normalized LM training text, available `here <http://www.openslr.org/11>`_\ , following this recipe (Jupyter notebook code):

.. code-block:: python

   import gzip
   import io
   import os

   from urllib import request

   # Grab corpus.
   url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
   data_upper = '/tmp/upper.txt.gz'
   request.urlretrieve(url, data_upper)

   # Convert to lowercase and cleanup.
   data_lower = '/tmp/lower.txt'
   with open(data_lower, 'w', encoding='utf-8') as lower:
       with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:

This file has been truncated. show original

syoon9 · November 14, 2019, 10:56pm

thank you very much for your quick reply. really appreciate it.