Custom LM causes terrible false positive rate

MattC_eostar · December 20, 2019, 4:36pm

Saw your fix in the git log after pulling and switching to master so fingers crossed!

A_N · May 14, 2020, 12:06am

Did you tune the LM hyperparameters alpha and beta?

@reuben Can you please let me know if default_alpha and default_beta should be fine tuned for 0.7.0 release or would it be ok to use --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 as specified the docs to generate custom LM?
If yes, do you have any docs on doing the grid search or random search?
thanks

lissyx · May 14, 2020, 12:22am

use lm_optimizer.py

A_N · May 14, 2020, 1:42am

thanks @lissyx
Just wanted to confirm that I doing this correctly:

Generate the lm.binary and the vocab_.txt (I am using a portion of normalized wikipedia dump)

python generate_lm.py --input_txt wikien.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Tune default_alpha and default_ beta

python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048

Would it be ok to use the librispeech test here? My real test data would be recorded conversations that are not fully labeled.

Use the generated lm.binary and vocab-1000000.txtto generate the scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha <value from step 2> --default_beta <value from step 2>

Am I missing thing here?
thanks

reuben · May 14, 2020, 7:06am

Generate the LM from your text corpus.
Generate the scorer package with LM above and any default alpha and beta values.
Run lm_optimizer.py with your own data, fine tuning on LibriSpeech test does nothing to help your use case.
At the end, regenerate the package with the new fine tuned alpha and beta values.

A_N · May 14, 2020, 4:38pm

Thanks @reuben
Is this order documented any where? Just wanted to make sure that I did not miss it.
Also I noticed that lm_optimizer.py has a default value for n_trials as 2400, should it be run this many times? I see that it takes about 25 min on for a trial on my hardware. Looking if there are ways to speed this or run lesser number of trials

One questions on the steps listed above:
During Step 3 I would use the default scorer generated from step 2

Generate lm.binary and vocab-1000000.txt

python generate_lm.py --input_txt wikien2_mod.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Generate kenlm_mod.scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

Optimize lm_alpha and lm_beta using above scorer.

CUDA_VISIBLE_DEVICES=1 python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048 --scorer kenlm_mod.scorer --n_trials 100

Regenerate kenlm_mod_new.scorer

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod_new.scorer --default_alpha --default_beta

Thanks

A_N · May 19, 2020, 2:09am

lm_optimizer.py on my test data gives a WER of 0.7 for 100 trials.

Finished trial#99 with value: 0.7084736251402918 with parameters: {'lm_alpha': 0.709291280206513, 'lm_beta': 1.6728729648380825}. Best is trial#86 with value: 0.706041900486345.
Best params: lm_alpha=0.6385672670016014 and lm_beta=1.257121392283404 with WER=0.7060419004863

Is there a guideline on how rows should be in the test csv file and how epochs this should be run for?

A_N · May 19, 2020, 1:27am

@reuben Can you please clarify your #3 – what does your own data mean here?
I am using Wikipedia normalized text to generate the lm.binary and vocab.***.txt. So does it mean I need a spoken version of this text compiled into a csv file (as in librivox-test-clean.csv) as input to lm_optimizer.py

or I can just use my test data which are meeting recordings compiled to a csv file in the same format as librivox-test-clean.csv.

reuben · May 20, 2020, 10:12am

I meant your own data that matches your use case as the validation set used by lm_optimizer. You could just use librispeech-dev-clean.csv, but that can only go so far compared to having data that matches your use case. Note that lm_optimizer uses the --test_files parameter as input but you should actually use a validation set, not a test set, to avoid skewing your results.

axcn · May 22, 2020, 4:56am

Is there any manuals for the procedure of training / optimization?

As same as below?

Execute training util get better WER.
Run lm_optimizer.py to get the best alpha and beta values.
Regenerate the package with the new fine tuned alpha and beta values.
Execute training again (more / less epochs than before).
Run lm_optimizer.py to get the best alpha and beta values again
.
.
.
util the result is acceptable?

ADDALA_VENKATA_KAVYA_SRI · August 28, 2020, 12:15pm

AttributeError: module ‘gast’ has no attribute ‘Index’
Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 300, in run
_run_main(main, args)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 946, in main
export()
File “DeepSpeech.py”, line 804, in export
checkpoint_path = checkpoint.model_checkpoint_path
AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’

Got stuck here…@lissyx can you please help me to solve this issue…Kindly

lissyx · August 28, 2020, 12:19pm

Your setup seems broken, I have no idea why and we don’t reproduce that error.

lissyx · August 28, 2020, 12:20pm

Also, please refrain from hijacking old topics with unrelated issues.

ADDALA_VENKATA_KAVYA_SRI · August 28, 2020, 12:22pm

@lissyx thank you…can you please give me step by step procedure to create our own model. Kindly…I am new to this

lissyx · August 28, 2020, 12:24pm

Are you new to reading and understanding what I wrote before?
We have extensive documentation, please read it.

lissyx · August 28, 2020, 12:24pm

https://deepspeech.readthedocs.io/en/latest/TRAINING.html

ADDALA_VENKATA_KAVYA_SRI · August 28, 2020, 12:24pm

@MattC_eostar Hi…Can you please tell me how you resolved this error…I am also facing the same issue

lissyx · August 28, 2020, 12:27pm

Please stop spamming.

ADDALA_VENKATA_KAVYA_SRI · August 28, 2020, 12:27pm

@lissyx yeah…I followed the same website and ended up with this error
File “DeepSpeech.py”, line 804, in export

checkpoint_path = checkpoint.model_checkpoint_path*
AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’

lissyx · August 28, 2020, 12:29pm

Well, too bad for you but:

no context, I do’nt know what exactl you did
you are still hijacking unrelated thread.

This is my last warning. We have guidance on how to reach for support, if you continue to refuse to cooperate I will have no choice but to stop help: my time can be used much more efficiently to advance the project than helping someone who refuses to make the minimum of efforts.