Language Model during training effect

Hi All,
I am trying to train and use a model for English from scratch on version 0.5.1. My aim is to train two models, one with and another without a language model. Request your help on several fronts please. Sorry this is long but trying be as detailed as possible; and also, being new to Linux and data-science I may be stating some very obvious things.
Thank you in advance for your help.

Part A) My Questions
Part B) Background info
Regards,
Rohit

Part A) My Questions

A1) When using a language model either for training or inference, do I HAVE to specify the lm_binary parameter AND the corresponding trie file? Can using only the lm_binary or trie parameter work?

A2) Say I train two models on same data. For first model with an LM specified (built using KenLM library on the vocabulary of transcripts used for training data, and specifying lm_binary and trie parameters). The second model is trained without any LM parameters. Later I use each of these models for inference. Can I choose to use OR not use a language model during the inference stage? Can a different language model be used during inference or should one use the same LM used in training? Are there things to note while choosing an alternative model? E.g. training using a 3-gram model but using a 4-gram model during inference? etc…

A3) I am facing a problem when I try to use a different LM from the one used during training. My model is trained with only 1k data points. The LM used was built using same 1k data points as vocabulary and a 4-gram lm_binary and trie was specified during training.

Inference works but is understandably very poor. Console Output:

(dpsp5v051basic) rohit@DE-W-0246802:~/dpspCODE/v051/DeepSpeech$ deepspeech
–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav
Loading model from file /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-08-01 16:11:02.155443: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-01 16:11:02.179690: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-08-01 16:11:02.179740: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:02.179756: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:02.179891: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 0.0283s.
Loading language model from files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
Loaded language model in 0.068s.
Running inference.
a on a in a is the
Inference took 0.449s for 3.041s audio file.

Now I want to use an LM created from a larger vocabulary file of say 600k data points (transcripts), which in this case does include the 1k wav files that were used as data for training. This is from the validated.tsv file of the CommonVoice2 corpus. I have double checked that the alphabet.txt for the first 1k data points vocabulary and the larger 600k vocabulary are identical. Also I have created the lm_binary and trie files (allValidated_o4gram.klm, allValidated_o4gram.trie) as 4-gram versions. Thus basic specs of the LM match the one used for training.
But while using the larger LM during inference I get an error saying “Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.”. Is it still loading the larger LM? Did Deepspeech actually pick it up and apply it correctly? How do I fix this error please?

Console output:

(dpsp5v051basic) rohit@DE-W-0246802:~/dpspCODE/v051/DeepSpeech$ deepspeech
–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav
Loading model from file /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-08-01 16:11:58.305524: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-01 16:11:58.322902: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-08-01 16:11:58.322945: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:58.322956: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:58.323063: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 0.0199s.
Loading language model from files /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
Loaded language model in 0.00368s.
Running inference.
an on o tn o as te tee
Inference took 1.893s for 3.041s audio file.

Note that the input audio is same File28.wav but the output transcript varies with different LMs:

a on a in a is the (smaller LM used in training and inference) vs
an on o tn o as te tee (using different larger LM for inference only)

A) Background:

A1) Ubuntu 18.04LTS, no GPU, 32GB ram, Deespeech v0.5.1 git repo.

  • Downloaded Mozilla Common Voice Corpus (English) around mid-June 2019.
  • Took the validated.tsv file, did some basic transcript validation and pruned dataset to 629731 entries.
  • Selected first 10k entries and split using ratio of 70:20:10 as train:dev:test and created csv files.
  • MP3s converted to wav files (16kHz, mono, 16bit), length less than 10 seconds.
    Setup Anaconda environment with Deepspeech v0.5.1.
  • Cloned github v0.5.1 code.
  • Issued command in the Deepspeech folder, which seems to be required to create the generate_trie executable and other required setup:
    python util/taskcluster.py --target .
  • Installed the CTC-decoder from the link obtained from command:
    python util/taskcluster.py --decoder
  • Next created vocabulary file with only the transcripts.
  • No changes in any of the flags and other default parameters.

A2) Language model related:

  • Used KenLM. Downloaded from git repo and compiled. Commands to create 4-gram version:
  • vocabulary file to arpa:

./lmplz -o 4 --text /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k.txt --arpa /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k_4gram.arpa

  • arpa to lm_binary file:

./build_binary /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k_4gram.arpa /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm

  • used the generate_trie to make the trie file

/home/rohit/dpspCODE/v051/DeepSpeech/generate_trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/trie/trie4gram/set3First10k_4gram.trie

  • Note the trie file was made successfully and later used to start training.

A3) Commands to start model training (training in progress still):

A3a) Model without language model:

python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/test.csv
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–n_hidden 2048
–epoch 20
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model5-validFirst10k-noLM/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model5-validFirst10k-noLM/checkpointDir
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt
“$@”

A3b) Model with Language model:

python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/test.csv
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–n_hidden 2048
–epoch 20
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model6-validFirst10k-yesLM-4gram/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model6-validFirst10k-yesLM-4gram/checkpointDir
–decoder_library_path /home/rohit/dpspCODE/v051/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt
–lm_binary_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm
–lm_trie_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/trie/trie4gram/set3First10k_4gram.trie
“$@”

Thank you for your time! Regards.

I’m unsure I understand properly, the help clearly states those parameters takes a path to those files. So you can’t use just --lm_binary_path --lm_trie_path.

If you read carefully docs and API, you sill see that you can run the decoding phase without a LM. And that you can specify files, so that answers for the next question.

No idea, it depends on your usecase.

I think you just don’t have enough trianing data. How much hours is this 1k data points ?

Honestly, I don’t know what I can tell you that is not already in this error. You’ve likely created your model from current master and are running against 0.5.1 binary. Use proper source code.

Different LM producing different inference results ? That’s by design, I don’t understand your question here.

Aside, we’ve got plenty of reports of issues with the use of anaconda, so we can’t guarantee this is going to work. Vanilla virtualenv should be enough.

And here is your mistake, you’re downloading from master. --help and --branch.

Thank you for repling @lissyx
I have a few follow up questions please.


A1) When using a language model either for training or inference, do I HAVE to specify the lm_binary parameter AND the corresponding trie file? Can using only the lm_binary or trie parameter work?

I’m unsure I understand properly, the help clearly states those parameters takes a path to those files. So you can’t use just --lm_binary_path --lm_trie_path .

Sorry if I wasn’t clear. Also extending the question to include the –decoder_library_path parameter. I am specifying the actual paths when training with an LM. Please see " A3b) Model with Language model:" section above.
I meant, can I do training with using only one or any two of these parameter? Or MUST I use all three simultaneously --lm_binary, --trie, --decoder_library_path?


I think you just don’t have enough trianing data. How much hours is this 1k data points ?

Agreed, the inference poor as very little training data is used for now, approx 1.25 hours used. I will be using more data for training later on.


But while using the larger LM during inference I get an error saying “Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.” . Is it still loading the larger LM? Did Deepspeech actually pick it up and apply it correctly? How do I fix this error please?

Honestly, I don’t know what I can tell you that is not already in this error. You’ve likely created your model from current master and are running against 0.5.1 binary. Use proper source code.

Model training done with v051 code (git clone and then git checkout tags/v051).

Installed the CTC-decoder as follows for ver0.5.1. Did I do this correctly?

(dpsp6v051basic) rohit@DE-W-0246802:~/dpspCODE/v051again/DeepSpeech$ python util/taskcluster.py --decoder
https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.5.1.cpu-ctc/artifacts/public/ds_ctcdecoder-0.5.1-cp36-cp36m-manylinux1_x86_64.whl
(dpsp6v051basic) rohit@DE-W-0246802:~/dpspCODE/v051again/DeepSpeech$
(dpsp6v051basic) rohit@DE-W-0246802:~/dpspCODE/v051again/DeepSpeech$ pip install https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.5.1.cpu-ctc/artifacts/public/ds_ctcdecoder-0.5.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting ds-ctcdecoder==0.5.1 from https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.5.1.cpu-ctc/artifacts/public/ds_ctcdecoder-0.5.1-cp36-cp36m-manylinux1_x86_64.whl
Using cached https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.5.1.cpu-ctc/artifacts/public/ds_ctcdecoder-0.5.1-cp36-cp36m-manylinux1_x86_64.whl
Requirement already satisfied: numpy>=1.7.0 in /home/rohit/anaconda3/envs/dpsp6v051basic/lib/python3.6/site-packages (from ds-ctcdecoder==0.5.1) (1.15.4)
Installing collected packages: ds-ctcdecoder
Successfully installed ds-ctcdecoder-0.5.1
(dpsp6v051basic) rohit@DE-W-0246802:~/dpspCODE/v051again/DeepSpeech$


My point is that despite the error “Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.” , the inference is different with specifying Bigger Vocabulary LM as opposed to inference from smaller vocabulary where there is no error. So did Deepspeech still pick up the bigger LM model despite the version mismatch and use it successfully for inference ?

  • Issued command in the Deepspeech folder, which seems to be required to create the generate_trie executable and other required setup:
    python util/taskcluster.py --target .
  • Installed the CTC-decoder from the link obtained from command:
    python util/taskcluster.py --decoder

And here is your mistake, you’re downloading from master. --help and --branch .

Will explore the --branch option.
Using regular virtualenv instead of Anaconda, noted.

No, you can use each one independantly, but they do require files. Also, --decoder_library_path is not used since a few months now.

It looks so.

You are still comparing with training on 1 epoch for 1.25 hours ? It’s likely just noise that has been learnt.

You are still comparing with training on 1 epoch for 1.25 hours ? It’s likely just noise that has been learnt.

Agreed its extremely low data. I will try to train on more data and/or more epochs. But right now am exploring understanding how to use LM properly in training and inference. See its impact without too much training data.
Incidentally the model inference shown earlier was trained for 5 epochs, about 1k data points, as per training command below:

python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/csvFiles/test.csv
–train_batch_size 8
–dev_batch_size 4
–test_batch_size 4
–n_hidden 512
–epoch 5
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/checkpointDir
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–decoder_library_path /home/rohit/dpspCODE/v051/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–lm_binary_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm
–lm_trie_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
“$@”

Will play around a bit more and reach out for further guidance again. Thank you so much @lissyx , you guys are doing a fantastic job making this tool available to newbies like me!

You can just make a different LM, domain-specific LM and compare on poor audio / accent the WER.

We did that, official pre-trained english model + LM VS official pre-trained english model and domain-specific only LM (a few hundred command sentences), results were obvious.

Thank you for the input.