Optimization of Deepspeech training in multi-GPU environment

Dear Deepspeech team,

I need some advice to optimize DeepSpeech training in a multi-GPU environment. I tried to find my answers in the various posts and in you deepspeech-playbook without success.

Here is the configuration of my server:

  • OS: Ubuntu 20.04.2 LTS
  • CPU: 2 x Intel® Xeon® Silver 4114 CPU @ 2.20GHz
  • GPU: 2 x NVIDIA Quadro P5000 16Go
  • RAM: 64Go
  • ROM: 512 Go SSD

About Drivers and TF:

  • Driver Version: 460.32.03
  • CUDA Version: 11.2
  • Tensorflow: v1.15.3-68-gdf8c55c 1.15.4
  • Python: 3.6.9

Here is the command I use to start the training on DeepSpeech 0.9.3:

python3 ./DeepSpeech.py \
    --train_cudnn True \
    --train_files $d/clips/train.csv \
    --dev_files $d/clips/dev.csv \
    --test_files $d/clips/test.csv \
    --audio_sample_rate 32000 \
    --epochs 1 \
    --summary_dir $FOLDER/summaries/ \
    --checkpoint_dir $FOLDER/checkpoints/ \
    --n_hidden 1024 \
    --export_dir $FOLDER/model/ \
    2>&1 | tee $FOLDER/training.log

The point is that:

  • When I train DeepSpeech on this multi-GPU environment, training last 7m02s
  • When I train Deepspeech on a single GPU (setting ENV variable CUDA_VISIBLE_DEVICES to 0 or 1 prior to call python), training last 8m48s

It means that training on a 2 GPUs environment is only 20% more efficient than a single one, whereas I’ve seen in a post that it should theoretically be twice faster. Moreover, according to nvidia-smi outputs:

  • during single GPU training, GPU is used 100% of time
  • during 2-GPU training, GPUs usage occilates between 30 and 60%.

Note if any: the same python process (PID) uses both GPUs.

Could you provide me with advice to find where is the bottle-neck in my configuration? Or is there any way to improve multi-GPU training by tuning training parameters?

Many thanks in advance,

Fabien.

You don’t mention your dataset. From the naming I assume this is Common Voice?

Why reducing the model size to 1024 ?

Also, there’s no batch size set, so you are defaulting to 1, which is very much not optimal

If you are training from Common Voice, I don’t think this is a good idea, the sample rate is expected to be 16kHz.

Please follow the docs accurately, training requires 10.0.

Thanks for your quick feedback Lissyx!

No it’s not. It’s a custom dataset to built a model able to understand specific technical vocabulary. Audio in input is sampled at 32KHz, that’s why I set the audio_sample_rate atttribute to 32000.

Just to fasten the training, it’s not the final configuration.

Correct, I’ve identified that point but for this test on GPU performance I assumed it didn’t matter. Now you highlight that point, maybe my hypothesis was wrong. Any advice for the batch size? (Nb: training contains 7500 clips, val & test 1000).

My bad, I totally missed that prerequisite. I’ll check how to downgrade CUDA version to 10.0.

For my curiosity, as the training seems to work well with CUDA 11, what are the expectations of improvement with CUDA 10.0? Do you think my performance issue may come from that point?

It highly depends on the average length of your datas, as well as the available memory.

On mixed datasets with data ranging up to 10 secs, I can push a batch size 64 without automated mixed precision on a 2x RTX2080Ti (11GB VRAM) setup.

The dependency is a hard one from tensorflow, I would not even expect things to work at all.

Confirmed,

Using a batch size of 16 (to have a correct number of train/val/test epoch wrt size of my dataset), I almost reach a validation twice faster on the 2-GPU environment compared to single one: for 60 epochs, 2h40 for the 2-GPUs machine vs 5h00 for the single one.

Note that I achieved it with CUDA 11 that I didn’t downgraded yet.

Thanks Lissyx