However, if I add the --display_step 2 argument, it significantly slows down the training time, and it uses less than 20% of the 4 GPUs.
It surprises me as the --display_step is described as:
tf.app.flags.DEFINE_integer (âvalidation_stepâ, 0, ânumber of epochs we cycle through before validating the model - a detailed progress report is dependent on ââdisplay_stepâ - 0 means no validation stepsâ)
so from my understanding the model should be evaluated once every 2 epochs, and therefore shouldnât slow down the training time (i.e., it should just add some evaluating time once every 2 epochs).
Why does adding the --display_step 2 argument significantly slow down Mozilla DeepSpeech training time?
The documentation is incorrect, display_step only controls the creation of WER reports, not the evaluation of the model on the validation or test sets. Report generation is very costly because it runs on the CPU and needs to decode all of the strings on the validation or test set.
it seems display_step>0 (enabling wer report) would cause the ops to run on CPU only.
If you disable it, you would not know how the training is progressing.
so, how can I have both? any idea?
The report ops run on the CPU because they donât have GPU implementations. Everything else works normally on the GPU regardless of the value of display_step.
@reuben
when display_step = 0, it runs on gpu and volatile is fluctuating between 60%-80%.
but when i set display_step=1, gpu memory usage is normal while volatile is always 0% and from the printed log, I can see that it takes quite a long time to process each batch.
so, how can I locate the real cause of this problem?
The problem is that the GPU training code is blocked waiting for the report creation code. You could use display_step > 1 as a workaround, or you could offload report generation to a different process/machine by copying the checkpoint and starting a test epoch on the copy.