Performance with version 0.9.3 is a lot worse than version 0.6.1

rajpuneet.sandhu · January 12, 2021, 8:53pm

Using Linux and deepspeech-gpu, I trained a model with 0.9.3 with the following command:

python3 DeepSpeech.py \
--alphabet_config_path data/alphabet.txt \
--beam_width 32 \
--checkpoint_dir $ckpt_dir \
--export_dir $ckpt_dir \
--scorer $scorer_path \
--n_hidden 128 \
--learning_rate 0.0001 \
--lm_alpha 0.75 \
--lm_beta 1.85 \
--train_batch_size 6 \
--dev_batch_size 6 \
--test_batch_size 6 \
--report_count 10 \
--epochs 500 \
--noearly_stop \
--noshow_progressbar \
--export_tflite \
--train_files /datasets/deepspeech_wakeword_dataset/wakeword-train.csv,\
/datasets/deepspeech_wakeword_dataset/wakeword-train-other-accents.csv,\
/datasets/deepspeech_wakeword_dataset/wakeword-train.csv,\
/datasets/india_portal_2may2019-train.csv,\
/datasets/india_portal_2to9may2019-train.csv,\
/datasets/india_portal_9to19may2019-train.csv,\
/datasets/india_portal_19to24may2019-train.csv,\
/datasets/brazil_portal_20to26june2019-wakeword-train.csv,\
/datasets/brazil_portal_26juneto3july2019-wakeword-train.csv,\
/datasets/japan_portal_3july2019-wakeword-train.csv,\
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-train.csv,\
/datasets/alexa-train.csv,\
/datasets/alexa-polly-train.csv,\
/datasets/alexa-sns.csv,\
/datasets/india_portal_ww_data_04282020/custom_train.csv,\
/datasets/india_portal_ww_data_05042020/custom_train.csv,\
/datasets/india_portal_ww_data_05222020/custom_train.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_train.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_test.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_train.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_test.csv,\
/datasets/ww_gtts_data_google_siri/custom_train.csv,\
/datasets/ww_gtts_data_google_siri/custom_dev.csv,\
/datasets/ww_polly_data_google_siri/custom_train.csv,\
/datasets/ww_polly_data_google_siri/custom_test.csv \
--dev_files /datasets/deepspeech_wakeword_dataset/wakeword-dev.csv,\
/datasets/india_portal_2may2019-dev.csv,\
/datasets/india_portal_2to9may2019-dev.csv,\
/datasets/india_portal_9to19may2019-dev.csv,\
/datasets/india_portal_19to24may2019-dev.csv,\
/datasets/brazil_portal_20to26june2019-wakeword-dev.csv,\
/datasets/brazil_portal_26juneto3july2019-wakeword-dev.csv,\
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-dev.csv,\
/datasets/alexa-dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/ww_gtts_data_google_siri/custom_dev.csv,\
/datasets/ww_polly_data_google_siri/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv \
--test_files /datasets/alexa-sns.csv,\
/datasets/india_portal_ww_data_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_04282020/custom_test.csv,\
/datasets/india_portal_ww_data_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05042020/custom_test.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_test.csv

I also previously trained a model with 0.6.1 with the following command using the same datasets for train, dev and test and keeping all the hyper parameters same:

python3 DeepSpeech.py \
--alphabet_config_path data/alphabet.txt \
--beam_width 32 \
--checkpoint_dir $ckpt_dir \
--export_dir $ckpt_dir \
--lm_binary_path $lm_path/lm.binary \
--lm_trie_path $lm_path/trie \
--n_hidden 128 \
--learning_rate 0.0001 \
--lm_alpha 0.75 \
--lm_beta 1.85 \
--train_batch_size 6 \
--dev_batch_size 6 \
--test_batch_size 4 \
--report_count 10 \
--epochs 500 \
--noearly_stop \
--noshow_progressbar \
--export_tflite \
--dev_files /datasets/deepspeech_wakeword_dataset/wakeword-dev.csv,\
/datasets/india_portal_2may2019-dev.csv,\
/datasets/india_portal_2to9may2019-dev.csv,\
/datasets/india_portal_9to19may2019-dev.csv,\
/datasets/india_portal_19to24may2019-dev.csv,\
/datasets/brazil_portal_20to26june2019-wakeword-dev.csv,\
/datasets/brazil_portal_26juneto3july2019-wakeword-dev.csv,\
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-dev.csv,\
/datasets/alexa-dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/ww_gtts_data_google_siri/custom_dev.csv,\
/datasets/ww_polly_data_google_siri/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv \
--test_files /datasets/alexa-sns.csv,\
/datasets/india_portal_ww_data_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_04282020/custom_test.csv,\
/datasets/india_portal_ww_data_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05042020/custom_test.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_test.csv

But, the average WER on all these datasets is 21.26% for 0.6.1 and 44.41% for 0.9.3. The text corpus used for LM and scorer was the same in both the cases

othiele · January 13, 2021, 8:07am

Can you test some sample without giving a scorer argument and compare the output? You didn’t run lm_optimizer and I don’t know how you built the scorer. Usually this shouldn’t worsen the WER that much, but who knows.

You seem to have a pretty special setup. Can you describe your use case?

lissyx · January 13, 2021, 8:47am

please compare without LM.

lissyx · January 13, 2021, 9:26am

@rajpuneet.sandhu Also, you just re-trained using the same parameters, but some might require re-evaluation: LM alpha/beta, model complexity (128 ? that seems very low), beam width 32 looks low as well, 500 epochs looks like a lot.

Is this English ? Indian ? English with Indian accent ?

rajpuneet.sandhu · January 13, 2021, 4:03pm

I ran the test without an LM and scorer and now the WER is similar. It is 77% for 0.6.1 and 80% for 0.9.3. This suggests that the difference is due to LM and scorer. I generated the scorer using exactly what was there in the documentation:

python3 /home/rsandhu/deepspeech_v091/DeepSpeech/data/lm/generate_lm.py --input_txt /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/text-corpus.txt --output_dir /home/rsandhu/deepspeech_v091/ww_scorer \
  --top_k 500000 --kenlm_bins /home/rsandhu/kenlm/build/bin/ \
  --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
  --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

/home/rsandhu/deepspeech_v091/DeepSpeech/data/lm/generate_scorer_package --alphabet /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/alphabet.txt --lm /home/rsandhu/deepspeech_v091/ww_scorer/lm.binary --vocab /home/rsandhu/deepspeech_v091/ww_scorer/vocab-500000.txt \
  --package /home/rsandhu/deepspeech_v091/ww_scorer/kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

And this is how I have built the lm.binary and trie:

/home/rsandhu/kenlm/build/bin/lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text text-corpus.txt \
       --arpa lm.arpa \
       --prune 0 0 0 1 --discount_fallback 1

/home/rsandhu/kenlm/build/bin/build_binary -a 255 \
              -q 8 \
              trie \
              lm.arpa \
              lm.binary

~/deepspeech_061/DeepSpeech/generate_trie /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/alphabet.txt /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/lm.binary /home/rsandhu/ds_061_trie/trie_ww

I am using this for hotword detection and the dataset consists of only this data in this training experiment and it’s only 4 hours of data. The words are english and the accents are mixed… Indian, American, British and a few more. Do you guys have any suggestions on how to improve the performance?

othiele · January 14, 2021, 8:21am

You have a special use case and I would advise you to build your own scorer. Check the scripts and use KenLM as you did last time. Basically get rid of all commands that aim to minimize the model as you want the maximum of the input intact.

Topic		Replies	Views
Inference time for 0.9.3 is a lot more than 0.6.1 DeepSpeech	14	866	March 30, 2021
Bad training results DeepSpeech	7	740	April 22, 2020
Bad performance while using deepspeech-0.7.3-models.pbmm? DeepSpeech	1	896	June 11, 2020
Using Deep Speech DeepSpeech	34	12847	August 20, 2019
Custom LM causes terrible false positive rate DeepSpeech	48	2574	August 28, 2020

Performance with version 0.9.3 is a lot worse than version 0.6.1

Related topics