Only 1 CPU core utilized when computing batch for GPU

Is it possible to set more threads to be used?
When computing this line: https://github.com/mozilla/DeepSpeech/blob/master/DeepSpeech.py#L1637

I am using Tesla P100 GPU and 56 core CPU. But when using batch size > 1 cpu takes more time (using only 1 core) than gpu. So gpu stays without load.

That line just dispatches work to be executed in several threads and the GPU. It should not do any meaningful work on the main thread. It’s possible that your training process is bottlenecked by disk IO, waiting for the importers to load and pre-process the WAV files.

Thanks for possible bottleneck suggestion. But already with batch size 12 I see on htop only one core loaded by 100% for several seconds. Can it be disk I/O that loads CPU during this seconds?

Also maybe it is caused by TF not compiled for this specific CPU?

cause it gives warning:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

Solved it!
My problem was in parameter --display_step 1 that stuck in my commands somehow.
Making it 0 made training process 50 times faster.
This parameter makes program compute Word Error Rate (WER) on each step. Its slow and computes only on one core.
Related issue on GitHub: https://github.com/mozilla/DeepSpeech/issues/776