Any feedback on this train - what happened to early stopping?

utunga · June 7, 2020, 11:45pm

Hello everyone,

Just wondering if anyone has feedback on this training graph. Not loving the shape of it towards the end!

Kind of wondering why early stoppping didn’t kick in? You can see the exact train command used, below). Also just curious if anyone has ideas as to why it went so off base towards the end?

And finally just checking if there is a way to restore the earlier state from the checkpoints ?

The command used was

        python3 DeepSpeech/DeepSpeech.py  \
          --train_files             '/work/waha-tuhi/models/20200605_ds.0.7.1_thm/data/mi_train.csv' \
          --dev_files               '/work/waha-tuhi/models/20200605_ds.0.7.1_thm/data/mi_dev.csv' \
          --test_files              '/work/waha-tuhi/models/20200605_ds.0.7.1_thm/data/mi_test.csv' \
          --test_output_file        '/work/waha-tuhi/models/20200605_ds.0.7.1_thm/evaluate/test_transcripts.csv' \
          --scorer_path                         '/work/waha-tuhi/models/20200605_ds.0.7.1_thm/data/lm.scorer' \
          --alphabet_config_path    '/work/waha-tuhi/models/20200605_ds.0.7.1_thm/data/alphabet.txt' \
          --lm_alpha                0.75 \
          --lm_beta                 1.85 \
          --epochs                  100 \
          --train_batch_size        16 \
          --dev_batch_size          32 \
          --test_batch_size         32 \
          --learning_rate           0.0001 \
          --max_to_keep             1 \
          --dropout_rate            0.13 \
          --checkpoint_dir          /work/waha-tuhi/models/20200605_ds.0.7.1_thm/checkpoints \
          --log_level               0 \
          --show_progressbar        0 \
          --summary_dir             /work/waha-tuhi/models/20200605_ds.0.7.1_thm/summaries \
          --limit_train             0 \
          --limit_dev               0 \
          --limit_test              0 \
          --export_dir              /work/waha-tuhi/models/20200605_ds.0.7.1_thm/export \
          --checkpoint_secs         600 \
          --automatic_mixed_precision

utunga · June 7, 2020, 11:48pm

In digging up the exact command used in order to paste here I think I’ve answered my own question about early_stopping … looks like (even thought we have a EARLYSTOP_NSTEPS variable (set to 10) its not actually getting passed to the final command line that is actually running here. Do’h!

Still having written this out still interested in any speculation as to what might have gone wrong with the train itself…

reuben · June 8, 2020, 6:44am

Make sure to also specify --early_stop to enable it. The logic was changed a bit in a PR that introduced the ability to automatically reduce the LR on loss plateaus.

utunga · June 9, 2020, 5:46am

Thank you, yes. The variable fell out of our configuration pipeline at some point - woops.

For completeness, in case someone else sees this… with the parameters specified as follows

    	  --early_stop            1 
    	  --es_epochs             5

It did an early stop after 17 Epochs