Performance with version 0.9.3 is a lot worse than version 0.6.1

Using Linux and deepspeech-gpu, I trained a model with 0.9.3 with the following command:

python3 DeepSpeech.py \
--alphabet_config_path data/alphabet.txt \
--beam_width 32 \
--checkpoint_dir $ckpt_dir \
--export_dir $ckpt_dir \
--scorer $scorer_path \
--n_hidden 128 \
--learning_rate 0.0001 \
--lm_alpha 0.75 \
--lm_beta 1.85 \
--train_batch_size 6 \
--dev_batch_size 6 \
--test_batch_size 6 \
--report_count 10 \
--epochs 500 \
--noearly_stop \
--noshow_progressbar \
--export_tflite \
--train_files /datasets/deepspeech_wakeword_dataset/wakeword-train.csv,\
/datasets/deepspeech_wakeword_dataset/wakeword-train-other-accents.csv,\
/datasets/deepspeech_wakeword_dataset/wakeword-train.csv,\
/datasets/india_portal_2may2019-train.csv,\
/datasets/india_portal_2to9may2019-train.csv,\
/datasets/india_portal_9to19may2019-train.csv,\
/datasets/india_portal_19to24may2019-train.csv,\
/datasets/brazil_portal_20to26june2019-wakeword-train.csv,\
/datasets/brazil_portal_26juneto3july2019-wakeword-train.csv,\
/datasets/japan_portal_3july2019-wakeword-train.csv,\
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-train.csv,\
/datasets/alexa-train.csv,\
/datasets/alexa-polly-train.csv,\
/datasets/alexa-sns.csv,\
/datasets/india_portal_ww_data_04282020/custom_train.csv,\
/datasets/india_portal_ww_data_05042020/custom_train.csv,\
/datasets/india_portal_ww_data_05222020/custom_train.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_train.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_test.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_train.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_test.csv,\
/datasets/ww_gtts_data_google_siri/custom_train.csv,\
/datasets/ww_gtts_data_google_siri/custom_dev.csv,\
/datasets/ww_polly_data_google_siri/custom_train.csv,\
/datasets/ww_polly_data_google_siri/custom_test.csv \
--dev_files /datasets/deepspeech_wakeword_dataset/wakeword-dev.csv,\
/datasets/india_portal_2may2019-dev.csv,\
/datasets/india_portal_2to9may2019-dev.csv,\
/datasets/india_portal_9to19may2019-dev.csv,\
/datasets/india_portal_19to24may2019-dev.csv,\
/datasets/brazil_portal_20to26june2019-wakeword-dev.csv,\
/datasets/brazil_portal_26juneto3july2019-wakeword-dev.csv,\
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-dev.csv,\
/datasets/alexa-dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/ww_gtts_data_google_siri/custom_dev.csv,\
/datasets/ww_polly_data_google_siri/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv \
--test_files /datasets/alexa-sns.csv,\
/datasets/india_portal_ww_data_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_04282020/custom_test.csv,\
/datasets/india_portal_ww_data_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05042020/custom_test.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_test.csv

I also previously trained a model with 0.6.1 with the following command using the same datasets for train, dev and test and keeping all the hyper parameters same:

python3 DeepSpeech.py \
--alphabet_config_path data/alphabet.txt \
--beam_width 32 \
--checkpoint_dir $ckpt_dir \
--export_dir $ckpt_dir \
--lm_binary_path $lm_path/lm.binary \
--lm_trie_path $lm_path/trie \
--n_hidden 128 \
--learning_rate 0.0001 \
--lm_alpha 0.75 \
--lm_beta 1.85 \
--train_batch_size 6 \
--dev_batch_size 6 \
--test_batch_size 4 \
--report_count 10 \
--epochs 500 \
--noearly_stop \
--noshow_progressbar \
--export_tflite \
--dev_files /datasets/deepspeech_wakeword_dataset/wakeword-dev.csv,\
/datasets/india_portal_2may2019-dev.csv,\
/datasets/india_portal_2to9may2019-dev.csv,\
/datasets/india_portal_9to19may2019-dev.csv,\
/datasets/india_portal_19to24may2019-dev.csv,\
/datasets/brazil_portal_20to26june2019-wakeword-dev.csv,\
/datasets/brazil_portal_26juneto3july2019-wakeword-dev.csv,\
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-dev.csv,\
/datasets/alexa-dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/ww_gtts_data_google_siri/custom_dev.csv,\
/datasets/ww_polly_data_google_siri/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv \
--test_files /datasets/alexa-sns.csv,\
/datasets/india_portal_ww_data_04282020/custom_dev.csv,\
/datasets/india_portal_ww_data_04282020/custom_test.csv,\
/datasets/india_portal_ww_data_05042020/custom_dev.csv,\
/datasets/india_portal_ww_data_05042020/custom_test.csv,\
/datasets/india_portal_ww_data_05222020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_dev.csv,\
/datasets/india_portal_ww_data_06182020/custom_test.csv

But, the average WER on all these datasets is 21.26% for 0.6.1 and 44.41% for 0.9.3. The text corpus used for LM and scorer was the same in both the cases

Can you test some sample without giving a scorer argument and compare the output? You didn’t run lm_optimizer and I don’t know how you built the scorer. Usually this shouldn’t worsen the WER that much, but who knows.

You seem to have a pretty special setup. Can you describe your use case?

please compare without LM.

@rajpuneet.sandhu Also, you just re-trained using the same parameters, but some might require re-evaluation: LM alpha/beta, model complexity (128 ? that seems very low), beam width 32 looks low as well, 500 epochs looks like a lot.

Is this English ? Indian ? English with Indian accent ?

I ran the test without an LM and scorer and now the WER is similar. It is 77% for 0.6.1 and 80% for 0.9.3. This suggests that the difference is due to LM and scorer. I generated the scorer using exactly what was there in the documentation:

python3 /home/rsandhu/deepspeech_v091/DeepSpeech/data/lm/generate_lm.py --input_txt /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/text-corpus.txt --output_dir /home/rsandhu/deepspeech_v091/ww_scorer \
  --top_k 500000 --kenlm_bins /home/rsandhu/kenlm/build/bin/ \
  --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
  --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

/home/rsandhu/deepspeech_v091/DeepSpeech/data/lm/generate_scorer_package --alphabet /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/alphabet.txt --lm /home/rsandhu/deepspeech_v091/ww_scorer/lm.binary --vocab /home/rsandhu/deepspeech_v091/ww_scorer/vocab-500000.txt \
  --package /home/rsandhu/deepspeech_v091/ww_scorer/kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

And this is how I have built the lm.binary and trie:

/home/rsandhu/kenlm/build/bin/lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text text-corpus.txt \
       --arpa lm.arpa \
       --prune 0 0 0 1 --discount_fallback 1

/home/rsandhu/kenlm/build/bin/build_binary -a 255 \
              -q 8 \
              trie \
              lm.arpa \
              lm.binary

~/deepspeech_061/DeepSpeech/generate_trie /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/alphabet.txt /home/rsandhu/sns/sns-app-android/app/src/main/assets/ww/lm.binary /home/rsandhu/ds_061_trie/trie_ww

I am using this for hotword detection and the dataset consists of only this data in this training experiment and it’s only 4 hours of data. The words are english and the accents are mixed… Indian, American, British and a few more. Do you guys have any suggestions on how to improve the performance?

You have a special use case and I would advise you to build your own scorer. Check the scripts and use KenLM as you did last time. Basically get rid of all commands that aim to minimize the model as you want the maximum of the input intact.