Optimization of Deepspeech training in multi-GPU environment

crayabox · April 12, 2021, 10:44am

Dear Deepspeech team,

I need some advice to optimize DeepSpeech training in a multi-GPU environment. I tried to find my answers in the various posts and in you deepspeech-playbook without success.

Here is the configuration of my server:

OS: Ubuntu 20.04.2 LTS
CPU: 2 x Intel® Xeon® Silver 4114 CPU @ 2.20GHz
GPU: 2 x NVIDIA Quadro P5000 16Go
RAM: 64Go
ROM: 512 Go SSD

About Drivers and TF:

Driver Version: 460.32.03
CUDA Version: 11.2
Tensorflow: v1.15.3-68-gdf8c55c 1.15.4
Python: 3.6.9

Here is the command I use to start the training on DeepSpeech 0.9.3:

python3 ./DeepSpeech.py \
    --train_cudnn True \
    --train_files $d/clips/train.csv \
    --dev_files $d/clips/dev.csv \
    --test_files $d/clips/test.csv \
    --audio_sample_rate 32000 \
    --epochs 1 \
    --summary_dir $FOLDER/summaries/ \
    --checkpoint_dir $FOLDER/checkpoints/ \
    --n_hidden 1024 \
    --export_dir $FOLDER/model/ \
    2>&1 | tee $FOLDER/training.log

The point is that:

When I train DeepSpeech on this multi-GPU environment, training last 7m02s
When I train Deepspeech on a single GPU (setting ENV variable CUDA_VISIBLE_DEVICES to 0 or 1 prior to call python), training last 8m48s

It means that training on a 2 GPUs environment is only 20% more efficient than a single one, whereas I’ve seen in a post that it should theoretically be twice faster. Moreover, according to nvidia-smi outputs:

during single GPU training, GPU is used 100% of time
during 2-GPU training, GPUs usage occilates between 30 and 60%.

Note if any: the same python process (PID) uses both GPUs.

Could you provide me with advice to find where is the bottle-neck in my configuration? Or is there any way to improve multi-GPU training by tuning training parameters?

Many thanks in advance,

Fabien.

lissyx · April 12, 2021, 12:09pm

You don’t mention your dataset. From the naming I assume this is Common Voice?

crayabox:

python3 ./DeepSpeech.py \
    --train_cudnn True \
    --train_files $d/clips/train.csv \
    --dev_files $d/clips/dev.csv \
    --test_files $d/clips/test.csv \
    --audio_sample_rate 32000 \
    --epochs 1 \
    --summary_dir $FOLDER/summaries/ \
    --checkpoint_dir $FOLDER/checkpoints/ \
    --n_hidden 1024 \
    --export_dir $FOLDER/model/ \
    2>&1 | tee $FOLDER/training.log

Why reducing the model size to 1024 ?

Also, there’s no batch size set, so you are defaulting to 1, which is very much not optimal

If you are training from Common Voice, I don’t think this is a good idea, the sample rate is expected to be 16kHz.

Please follow the docs accurately, training requires 10.0.

crayabox · April 12, 2021, 3:27pm

Thanks for your quick feedback Lissyx!

No it’s not. It’s a custom dataset to built a model able to understand specific technical vocabulary. Audio in input is sampled at 32KHz, that’s why I set the audio_sample_rate atttribute to 32000.

Just to fasten the training, it’s not the final configuration.

Correct, I’ve identified that point but for this test on GPU performance I assumed it didn’t matter. Now you highlight that point, maybe my hypothesis was wrong. Any advice for the batch size? (Nb: training contains 7500 clips, val & test 1000).

My bad, I totally missed that prerequisite. I’ll check how to downgrade CUDA version to 10.0.

For my curiosity, as the training seems to work well with CUDA 11, what are the expectations of improvement with CUDA 10.0? Do you think my performance issue may come from that point?

lissyx · April 12, 2021, 3:30pm

It highly depends on the average length of your datas, as well as the available memory.

On mixed datasets with data ranging up to 10 secs, I can push a batch size 64 without automated mixed precision on a 2x RTX2080Ti (11GB VRAM) setup.

The dependency is a hard one from tensorflow, I would not even expect things to work at all.

crayabox · April 14, 2021, 4:03pm

Confirmed,

Using a batch size of 16 (to have a correct number of train/val/test epoch wrt size of my dataset), I almost reach a validation twice faster on the 2-GPU environment compared to single one: for 60 epochs, 2h40 for the 2-GPUs machine vs 5h00 for the single one.

Note that I achieved it with CUDA 11 that I didn’t downgraded yet.

Thanks Lissyx

Topic		Replies	Views
Performance (training time) of single vs dual gpu is same DeepSpeech learning	1	342	May 15, 2020
Deepspeech does not seem to use gpu while training, however does use it when using native-client DeepSpeech	17	1768	November 19, 2020
My RTX 4000 GPU is not used fully while i pre-trained the deepspeech 0.5.1 models DeepSpeech	5	1488	September 13, 2019
GPU Count Configuration with DeepSpeech Model Execution DeepSpeech learning	0	533	July 14, 2023
The same spped with cpu and with gpu DeepSpeech	42	2234	May 3, 2020

Optimization of Deepspeech training in multi-GPU environment

Related topics