Why does adding the `--display_step 2` argument significantly slow down the training time?

I am training Mozilla DeepSpeech on the Common Voice data set on Ubuntu 16.04 LTS x64 with 4 Nvidia GeForce GTX 1080 by executing the command:

./DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv \
--dev_files data/common-voice-v1/cv-valid-dev.csv  \
--test_files data/common-voice-v1/cv-valid-test.csv  \
--log_level 0 --train_batch_size 20 --train True  \
--decoder_library_path ./libctc_decoder_with_kenlm.so  \
--checkpoint_dir cv001 --export_dir cv001export  \
--summary_dir cv001summaries --summary_secs 600  \
--wer_log_pattern "GLOBAL LOG: logwer('${COMPUTE_ID}', '%s', '%s', %f)"  \
--validation_step 2 

It uses over 80% of the 4 GPUs.

However, if I add the --display_step 2 argument, it significantly slows down the training time, and it uses less than 20% of the 4 GPUs.

It surprises me as the --display_step is described as:

tf.app.flags.DEFINE_integer (‘validation_step’, 0, ‘number of epochs we cycle through before validating the model - a detailed progress report is dependent on “–display_step” - 0 means no validation steps’)

so from my understanding the model should be evaluated once every 2 epochs, and therefore shouldn’t slow down the training time (i.e., it should just add some evaluating time once every 2 epochs).

Why does adding the --display_step 2 argument significantly slow down Mozilla DeepSpeech training time?

The documentation is incorrect, display_step only controls the creation of WER reports, not the evaluation of the model on the validation or test sets. Report generation is very costly because it runs on the CPU and needs to decode all of the strings on the validation or test set.

Thanks, display_step seems to decode all of the strings on the training set as well, correct?

it seems display_step>0 (enabling wer report) would cause the ops to run on CPU only.
If you disable it, you would not know how the training is progressing.
so, how can I have both? any idea?

No, only the validation/test set.

The report ops run on the CPU because they don’t have GPU implementations. Everything else works normally on the GPU regardless of the value of display_step.

@reuben
when display_step = 0, it runs on gpu and volatile is fluctuating between 60%-80%.
but when i set display_step=1, gpu memory usage is normal while volatile is always 0% and from the printed log, I can see that it takes quite a long time to process each batch.

so, how can I locate the real cause of this problem?

The problem is that the GPU training code is blocked waiting for the report creation code. You could use display_step > 1 as a workaround, or you could offload report generation to a different process/machine by copying the checkpoint and starting a test epoch on the copy.

1 Like