That line just dispatches work to be executed in several threads and the GPU. It should not do any meaningful work on the main thread. It’s possible that your training process is bottlenecked by disk IO, waiting for the importers to load and pre-process the WAV files.
Thanks for possible bottleneck suggestion. But already with batch size 12 I see on htop only one core loaded by 100% for several seconds. Can it be disk I/O that loads CPU during this seconds?
Solved it!
My problem was in parameter --display_step 1 that stuck in my commands somehow.
Making it 0 made training process 50 times faster.
This parameter makes program compute Word Error Rate (WER) on each step. Its slow and computes only on one core.
Related issue on GitHub: https://github.com/mozilla/DeepSpeech/issues/776