Saw your fix in the git log after pulling and switching to master so fingers crossed!
Did you tune the LM hyperparameters alpha and beta?
@reuben Can you please let me know if default_alpha and default_beta should be fine tuned for 0.7.0 release or would it be ok to use --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 as specified the docs to generate custom LM?
If yes, do you have any docs on doing the grid search or random search?
thanks
use lm_optimizer.py
thanks @lissyx
Just wanted to confirm that I doing this correctly:
- Generate the lm.binary and the vocab_.txt (I am using a portion of normalized wikipedia dump)
python generate_lm.py --input_txt wikien.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
- Tune default_alpha and default_ beta
python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048
Would it be ok to use the librispeech test here? My real test data would be recorded conversations that are not fully labeled.
- Use the generated
lm.binary
andvocab-1000000.txt
to generate the scorer
python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha <value from step 2> --default_beta <value from step 2>
Am I missing thing here?
thanks
- Generate the LM from your text corpus.
- Generate the scorer package with LM above and any default alpha and beta values.
- Run
lm_optimizer.py
with your own data, fine tuning on LibriSpeech test does nothing to help your use case. - At the end, regenerate the package with the new fine tuned alpha and beta values.
Thanks @reuben
Is this order documented any where? Just wanted to make sure that I did not miss it.
Also I noticed that lm_optimizer.py
has a default value for n_trials
as 2400, should it be run this many times? I see that it takes about 25 min on for a trial on my hardware. Looking if there are ways to speed this or run lesser number of trials
One questions on the steps listed above:
During Step 3 I would use the default scorer generated from step 2
Generate lm.binary and vocab-1000000.txt
python generate_lm.py --input_txt wikien2_mod.txt --output_dir . --top_k 1000000 --kenlm_bins kenlm-master/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
Generate kenlm_mod.scorer
python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284
Optimize lm_alpha and lm_beta using above scorer.
CUDA_VISIBLE_DEVICES=1 python lm_optimizer.py --test_files bin/librispeech/librivox-test-clean.csv --checkpoint_dir deepspeech-0.7.0-checkpoint --n_hidden 2048 --scorer kenlm_mod.scorer --n_trials 100
Regenerate kenlm_mod_new.scorer
python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm_mod_new.scorer --default_alpha --default_beta
Thanks
lm_optimizer.py on my test data gives a WER of 0.7 for 100 trials.
Finished trial#99 with value: 0.7084736251402918 with parameters: {'lm_alpha': 0.709291280206513, 'lm_beta': 1.6728729648380825}. Best is trial#86 with value: 0.706041900486345. Best params: lm_alpha=0.6385672670016014 and lm_beta=1.257121392283404 with WER=0.7060419004863
Is there a guideline on how rows should be in the test csv file and how epochs this should be run for?
@reuben Can you please clarify your #3 – what does your own data mean here?
I am using Wikipedia normalized text to generate the lm.binary and vocab.***.txt. So does it mean I need a spoken version of this text compiled into a csv file (as in librivox-test-clean.csv) as input to lm_optimizer.py
or I can just use my test data which are meeting recordings compiled to a csv file in the same format as librivox-test-clean.csv.
I meant your own data that matches your use case as the validation set used by lm_optimizer. You could just use librispeech-dev-clean.csv, but that can only go so far compared to having data that matches your use case. Note that lm_optimizer uses the --test_files
parameter as input but you should actually use a validation set, not a test set, to avoid skewing your results.
Is there any manuals for the procedure of training / optimization?
As same as below?
- Execute training util get better WER.
- Run lm_optimizer.py to get the best alpha and beta values.
- Regenerate the package with the new fine tuned alpha and beta values.
- Execute training again (more / less epochs than before).
- Run lm_optimizer.py to get the best alpha and beta values again
.
.
.
util the result is acceptable?
AttributeError: module ‘gast’ has no attribute ‘Index’
Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 300, in run
_run_main(main, args)
File “/home/kavyasri/.local/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 946, in main
export()
File “DeepSpeech.py”, line 804, in export
checkpoint_path = checkpoint.model_checkpoint_path
AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’
Got stuck here…@lissyx can you please help me to solve this issue…Kindly
Your setup seems broken, I have no idea why and we don’t reproduce that error.
Also, please refrain from hijacking old topics with unrelated issues.
@lissyx thank you…can you please give me step by step procedure to create our own model. Kindly…I am new to this
Are you new to reading and understanding what I wrote before?
We have extensive documentation, please read it.
Please stop spamming.
@lissyx yeah…I followed the same website and ended up with this error
File “DeepSpeech.py”, line 804, in export
- checkpoint_path = checkpoint.model_checkpoint_path*
AttributeError: ‘NoneType’ object has no attribute ‘model_checkpoint_path’
Well, too bad for you but:
- no context, I do’nt know what exactl you did
- you are still hijacking unrelated thread.
This is my last warning. We have guidance on how to reach for support, if you continue to refuse to cooperate I will have no choice but to stop help: my time can be used much more efficiently to advance the project than helping someone who refuses to make the minimum of efforts.