Custom LM causes terrible false positive rate

They are not.

Good, I hope it will help in your case. Just be cautious to have used the proper master that contains the fix :slight_smile:

Saw your fix in the git log after pulling and switching to master so fingers crossed!

2 Likes

Did you tune the LM hyperparameters alpha and beta?

@reuben Can you please let me know if default_alpha and default_beta should be fine tuned for 0.7.0 release or would it be ok to use --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 as specified the docs to generate custom LM?
If yes, do you have any docs on doing the grid search or random search?
thanks

use lm_optimizer.py

thanks @lissyx
Just wanted to confirm that I doing this correctly:

  1. Generate the lm.binary and the vocab_.txt (I am using a portion of normalized wikipedia dump)

python generate_lm.py --input_txt wikien.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

  1. Tune default_alpha and default_ beta

python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048

Would it be ok to use the librispeech test here? My real test data would be recorded conversations that are not fully labeled.

  1. Use the generated lm.binary and vocab-1000000.txtto generate the scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha <value from step 2> --default_beta <value from step 2>

Am I missing thing here?
thanks

  1. Generate the LM from your text corpus.
  2. Generate the scorer package with LM above and any default alpha and beta values.
  3. Run lm_optimizer.py with your own data, fine tuning on LibriSpeech test does nothing to help your use case.
  4. At the end, regenerate the package with the new fine tuned alpha and beta values.

Thanks @reuben
Is this order documented any where? Just wanted to make sure that I did not miss it.
Also I noticed that lm_optimizer.py has a default value for n_trials as 2400, should it be run this many times? I see that it takes about 25 min on for a trial on my hardware. Looking if there are ways to speed this or run lesser number of trials

One questions on the steps listed above:
During Step 3 I would use the default scorer generated from step 2

Generate lm.binary and vocab-1000000.txt

python generate_lm.py --input_txt wikien2_mod.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Generate kenlm_mod.scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

Optimize lm_alpha and lm_beta using above scorer.

CUDA_VISIBLE_DEVICES=1 python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048 --scorer kenlm_mod.scorer --n_trials 100

Regenerate kenlm_mod_new.scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod_new.scorer --default_alpha --default_beta

Thanks

lm_optimizer.py on my test data gives a WER of 0.7 for 100 trials.

Finished trial#99 with value: 0.7084736251402918 with parameters: {'lm_alpha': 0.709291280206513, 'lm_beta': 1.6728729648380825}. Best is trial#86 with value: 0.706041900486345.
Best params: lm_alpha=0.6385672670016014 and lm_beta=1.257121392283404 with WER=0.7060419004863

Is there a guideline on how rows should be in the test csv file and how epochs this should be run for?

@reuben Can you please clarify your #3 – what does your own data mean here?
I am using Wikipedia normalized text to generate the lm.binary and vocab.***.txt. So does it mean I need a spoken version of this text compiled into a csv file (as in librivox-test-clean.csv) as input to lm_optimizer.py

or I can just use my test data which are meeting recordings compiled to a csv file in the same format as librivox-test-clean.csv.

I meant your own data that matches your use case as the validation set used by lm_optimizer. You could just use librispeech-dev-clean.csv, but that can only go so far compared to having data that matches your use case. Note that lm_optimizer uses the --test_files parameter as input but you should actually use a validation set, not a test set, to avoid skewing your results.

Is there any manuals for the procedure of training / optimization?

As same as below?

  1. Execute training util get better WER.
  2. Run lm_optimizer.py to get the best alpha and beta values.
  3. Regenerate the package with the new fine tuned alpha and beta values.
  4. Execute training again (more / less epochs than before).
  5. Run lm_optimizer.py to get the best alpha and beta values again
    .
    .
    .
    util the result is acceptable?

AttributeError: module ‘gast’ has no attribute ‘Index’
Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 300, in run
_run_main(main, args)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 946, in main
export()
File “DeepSpeech.py”, line 804, in export
checkpoint_path = checkpoint.model_checkpoint_path
AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’

Got stuck here…@lissyx can you please help me to solve this issue…Kindly

Your setup seems broken, I have no idea why and we don’t reproduce that error.

Also, please refrain from hijacking old topics with unrelated issues.

@lissyx thank you…can you please give me step by step procedure to create our own model. Kindly…I am new to this

Are you new to reading and understanding what I wrote before?
We have extensive documentation, please read it.

https://deepspeech.readthedocs.io/en/latest/TRAINING.html

@MattC_eostar Hi…Can you please tell me how you resolved this error…I am also facing the same issue

Please stop spamming.

@lissyx yeah…I followed the same website and ended up with this error
File “DeepSpeech.py”, line 804, in export

  • checkpoint_path = checkpoint.model_checkpoint_path*
    AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’