Custom LM causes terrible false positive rate

MattC_eostar · December 20, 2019, 4:05pm

So I switched to master and made sure all dependencies were installed. Was getting an attribute error from gast, but downgrading to 0.2.2 fixed that, but now the checkpoint loader is returning None.

I downloaded the checkpoint from the 0.6.0 release.

My command is

python DeepSpeech.py --checkpoint_dir ./model/deepspeech-0.6.0-checkpoint/ --export_tflite --export_dir ./model --lm ./model/lm.binary --trie ./model/trie

and the error:

I Exporting the model...
WARNING:tensorflow:From DeepSpeech.py:705: The name tf.nn.rnn_cell.LSTMStateTuple is deprecated. Please use tf.compat.v1.nn.rnn_cell.LSTMStateTuple instead.

W1220 11:01:21.922978 4594513344 deprecation_wrapper.py:119] From DeepSpeech.py:705: The name tf.nn.rnn_cell.LSTMStateTuple is deprecated. Please use tf.compat.v1.nn.rnn_cell.LSTMStateTuple instead.

WARNING:tensorflow:From DeepSpeech.py:131: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
W1220 11:01:22.023649 4594513344 deprecation.py:323] From DeepSpeech.py:131: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From DeepSpeech.py:141: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API
W1220 11:01:22.108879 4594513344 deprecation.py:323] From DeepSpeech.py:141: static_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell, unroll=True)`, which is equivalent to this API
WARNING:tensorflow:From /Users/mattc/anaconda3/envs/deepspeech/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W1220 11:01:22.118990 4594513344 deprecation.py:506] From /Users/mattc/anaconda3/envs/deepspeech/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /Users/mattc/anaconda3/envs/deepspeech/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py:961: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W1220 11:01:22.144685 4594513344 deprecation.py:506] From /Users/mattc/anaconda3/envs/deepspeech/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py:961: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Traceback (most recent call last):
  File "DeepSpeech.py", line 966, in <module>
    absl.app.run(main)
  File "/Users/mattc/anaconda3/envs/deepspeech/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/mattc/anaconda3/envs/deepspeech/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 947, in main
    export()
  File "DeepSpeech.py", line 805, in export
    checkpoint_path = checkpoint.model_checkpoint_path
AttributeError: 'NoneType' object has no attribute 'model_checkpoint_path'

lissyx · December 20, 2019, 4:09pm

Are you sure you have a checkpoint file in this directory ? Can you verify its content ?

$ ls -al eng/deepspeech-0.6.0-checkpoint/checkpoint 
lrwxrwxrwx 1 alex alex 19 déc.   6 10:41 eng/deepspeech-0.6.0-checkpoint/checkpoint -> best_dev_checkpoint
$ cat eng/deepspeech-0.6.0-checkpoint/checkpoint 
model_checkpoint_path: "best_dev-233784"
all_model_checkpoint_paths: "best_dev-233784"

You might need to symlink like that.

MattC_eostar · December 20, 2019, 4:30pm

That did it! Thank you. Are the --lm and --trie flags used when converting the model? I looked at the code and it didn’t seem like it.

Rerunning the GA with the new model to see what minimal WER I can get.

lissyx · December 20, 2019, 4:33pm

They are not.

Good, I hope it will help in your case. Just be cautious to have used the proper master that contains the fix

MattC_eostar · December 20, 2019, 4:36pm

Saw your fix in the git log after pulling and switching to master so fingers crossed!

A_N · May 14, 2020, 12:06am

Did you tune the LM hyperparameters alpha and beta?

@reuben Can you please let me know if default_alpha and default_beta should be fine tuned for 0.7.0 release or would it be ok to use --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 as specified the docs to generate custom LM?
If yes, do you have any docs on doing the grid search or random search?
thanks

lissyx · May 14, 2020, 12:22am

use lm_optimizer.py

A_N · May 14, 2020, 1:42am

thanks @lissyx
Just wanted to confirm that I doing this correctly:

Generate the lm.binary and the vocab_.txt (I am using a portion of normalized wikipedia dump)

python generate_lm.py --input_txt wikien.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Tune default_alpha and default_ beta

python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048

Would it be ok to use the librispeech test here? My real test data would be recorded conversations that are not fully labeled.

Use the generated lm.binary and vocab-1000000.txtto generate the scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha <value from step 2> --default_beta <value from step 2>

Am I missing thing here?
thanks

reuben · May 14, 2020, 7:06am

Generate the LM from your text corpus.
Generate the scorer package with LM above and any default alpha and beta values.
Run lm_optimizer.py with your own data, fine tuning on LibriSpeech test does nothing to help your use case.
At the end, regenerate the package with the new fine tuned alpha and beta values.

A_N · May 14, 2020, 4:38pm

Thanks @reuben
Is this order documented any where? Just wanted to make sure that I did not miss it.
Also I noticed that lm_optimizer.py has a default value for n_trials as 2400, should it be run this many times? I see that it takes about 25 min on for a trial on my hardware. Looking if there are ways to speed this or run lesser number of trials

One questions on the steps listed above:
During Step 3 I would use the default scorer generated from step 2

Generate lm.binary and vocab-1000000.txt

python generate_lm.py --input_txt wikien2_mod.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Generate kenlm_mod.scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

Optimize lm_alpha and lm_beta using above scorer.

CUDA_VISIBLE_DEVICES=1 python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048 --scorer kenlm_mod.scorer --n_trials 100

Regenerate kenlm_mod_new.scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod_new.scorer --default_alpha --default_beta

Thanks

A_N · May 19, 2020, 2:09am

lm_optimizer.py on my test data gives a WER of 0.7 for 100 trials.

Finished trial#99 with value: 0.7084736251402918 with parameters: {'lm_alpha': 0.709291280206513, 'lm_beta': 1.6728729648380825}. Best is trial#86 with value: 0.706041900486345.
Best params: lm_alpha=0.6385672670016014 and lm_beta=1.257121392283404 with WER=0.7060419004863

Is there a guideline on how rows should be in the test csv file and how epochs this should be run for?

A_N · May 19, 2020, 1:27am

@reuben Can you please clarify your #3 – what does your own data mean here?
I am using Wikipedia normalized text to generate the lm.binary and vocab.***.txt. So does it mean I need a spoken version of this text compiled into a csv file (as in librivox-test-clean.csv) as input to lm_optimizer.py

or I can just use my test data which are meeting recordings compiled to a csv file in the same format as librivox-test-clean.csv.

reuben · May 20, 2020, 10:12am

I meant your own data that matches your use case as the validation set used by lm_optimizer. You could just use librispeech-dev-clean.csv, but that can only go so far compared to having data that matches your use case. Note that lm_optimizer uses the --test_files parameter as input but you should actually use a validation set, not a test set, to avoid skewing your results.

axcn · May 22, 2020, 4:56am

Is there any manuals for the procedure of training / optimization?

As same as below?

Execute training util get better WER.
Run lm_optimizer.py to get the best alpha and beta values.
Regenerate the package with the new fine tuned alpha and beta values.
Execute training again (more / less epochs than before).
Run lm_optimizer.py to get the best alpha and beta values again
.
.
.
util the result is acceptable?

ADDALA_VENKATA_KAVYA_SRI · August 28, 2020, 12:15pm

AttributeError: module ‘gast’ has no attribute ‘Index’
Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 300, in run
_run_main(main, args)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 946, in main
export()
File “DeepSpeech.py”, line 804, in export
checkpoint_path = checkpoint.model_checkpoint_path
AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’

Got stuck here…@lissyx can you please help me to solve this issue…Kindly

lissyx · August 28, 2020, 12:19pm

Your setup seems broken, I have no idea why and we don’t reproduce that error.

lissyx · August 28, 2020, 12:20pm

Also, please refrain from hijacking old topics with unrelated issues.

ADDALA_VENKATA_KAVYA_SRI · August 28, 2020, 12:22pm

@lissyx thank you…can you please give me step by step procedure to create our own model. Kindly…I am new to this

lissyx · August 28, 2020, 12:24pm

Are you new to reading and understanding what I wrote before?
We have extensive documentation, please read it.

lissyx · August 28, 2020, 12:24pm

https://deepspeech.readthedocs.io/en/latest/TRAINING.html