Training - test set is run on CPU with a single thread

Every time I train my model, there is a final testing step which outputs WER.

During training, “watch -n 0.5 nvidia-smi” tells me that my GPU is used. But during final testing, only the CPU is used. The result is that testing takes almost as much time as the whole training process, which is very painful.

This happens just after the “FINISHED Optimization” text, between “Starting batch…” and “Finished batch step”, so I suppose that this is not caused by some other CPU heavy task. Anyway, the CPU load is even lower than while using GPU because a single thread seems to be used.

I am using the following command :

CUDA_VISIBLE_DEVICES=0 LD_LIBRARY_PATH=native_client/ python -u DeepSpeech.py --log_level 1 --train_files data/train/list.csv --dev_files data/dev/list.csv --test_files data/test/list.csv --checkpoint_dir ${out_dir}/checkpoints --summary_dir ${out_dir}/tensor_board --alphabet_config_path data/alphabet.txt --lm_binary_path data/LE_MONDE_full.utf8.binary_lm --lm_trie_path data/LE_MONDE_full.utf8.deep_speech_trie --use_seq_length False --validation_step 2 --n_hidden 300 --train_batch_size 100 --dev_batch_size 100 --test_batch_size 100 --epoch 40

I am new to TensorFlow, so I don’t know what could cause this behavior…

My guess is that the “final testing” part that is only using the CPU is the calculation of the WER. I agree it is annoying and slow.

This is, however, relatively painful to fix as putting the entire WER calculation on the GPU is much harder than one expects. We initially tried to do so, but then realized the person hours would be, for our setup, better offset by CPU hours.

If you ave any ideas on how this could be expedited let us know!

I had the impression that this final step does exactly the same inference processing than during training, plus WER calculation, and that even inference is run on GPU.

But you are saying that only WER calculation is the real bottleneck, is that right ? (just making sure I understand correctly)

In that case, I do not think that it should take so long. I will have a look at it.

1 Like

Yes, the final step does the exact same inference computation as in training plus WER.

The inference is on the GPU while the WER calculation is on the CPU.

The WER is CPU bound and slow.

If you can take a look at putting the WER calculation on the GPU, it would be much appreciated.

Probably an easy improvement would be to make this loop multithreaded on CPU

1 Like

Yes, I have a version that does that with multiprocessing.Pool here: https://github.com/mozilla/DeepSpeech/blob/b9993aef8cb645d4377bc46ec999d15a9f5a0596/evaluate.py#L144-L170

It still isn’t clear to me what we’ll do the WER calculation code in DeepSpeech.py once that branch is merged, probably remove it and call evaluate.py instead.

The problem with applying that same idea directly on DeepSpeech.py as it exists on master today is that calculate_report gets called per job (batch), not for the entire test set, so each time it gets called with only a few inputs and you lose a lot of time just dispatching jobs and collecting results hundreds of times.

1 Like

@reuben, actually found out that most time consuming during testing for me is LM decoding.
Because it takes 8 seconds to compute batch of size 32 for me, if i reduce beam_width from 1024 -> 10 then batch computation takes 0.2 seconds. And during batch computation I see that 2 cpus (cause I have 2 towers) take 100%, while GPU usage is 0.

Is beam search with LM supposed to be running on GPU?

Also maybe its my problem only cause I built deepspeech from source:

RUN bazel build --config=opt --config=cuda -c opt --copt=-O3 //native_client:libctc_decoder_with_kenlm.so  --verbose_failures --action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}
RUN bazel build --config=monolithic -c opt --copt=-O3 --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:deepspeech_utils //native_client:generate_trie --verbose_failures --action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}
RUN bazel build --config=opt --config=cuda  --copt=-msse4.2 //tensorflow/tools/pip_package:build_pip_package --verbose_failures --action_env=LD_LIBRARY_PATH=${LD_LIBRARY_PATH}

As much as I can tell, KenLM is pure CPU, no GPU code. If you build from source, you might be able to experiment more than --copt=-O3 for inference, but that won’t help that much during training. I’d advise rely on upstream TensorFlow packages for training.

It seems like you wrote some Docker, want to contribute that ? I’d be happy to review your PR :slight_smile:

I see, and it’s not multithreaded as well as TF version (#17136).

I will submit dockerfile after I clean it up)

1 Like

Thanks! We’ll have to figure out how to properly test that as well on TaskCluster, but we should be able to find a way :slight_smile:

Please don’t hijack a 2 year old thread. Delete this one and start a new post.