Dear Deepspeech team,
I need some advice to optimize DeepSpeech training in a multi-GPU environment. I tried to find my answers in the various posts and in you deepspeech-playbook without success.
Here is the configuration of my server:
- OS: Ubuntu 20.04.2 LTS
- CPU: 2 x Intel® Xeon® Silver 4114 CPU @ 2.20GHz
- GPU: 2 x NVIDIA Quadro P5000 16Go
- RAM: 64Go
- ROM: 512 Go SSD
About Drivers and TF:
- Driver Version: 460.32.03
- CUDA Version: 11.2
- Tensorflow: v1.15.3-68-gdf8c55c 1.15.4
- Python: 3.6.9
Here is the command I use to start the training on DeepSpeech 0.9.3:
python3 ./DeepSpeech.py \
--train_cudnn True \
--train_files $d/clips/train.csv \
--dev_files $d/clips/dev.csv \
--test_files $d/clips/test.csv \
--audio_sample_rate 32000 \
--epochs 1 \
--summary_dir $FOLDER/summaries/ \
--checkpoint_dir $FOLDER/checkpoints/ \
--n_hidden 1024 \
--export_dir $FOLDER/model/ \
2>&1 | tee $FOLDER/training.log
The point is that:
- When I train DeepSpeech on this multi-GPU environment, training last 7m02s
- When I train Deepspeech on a single GPU (setting ENV variable
CUDA_VISIBLE_DEVICES
to 0 or 1 prior to call python), training last 8m48s
It means that training on a 2 GPUs environment is only 20% more efficient than a single one, whereas I’ve seen in a post that it should theoretically be twice faster. Moreover, according to nvidia-smi
outputs:
- during single GPU training, GPU is used 100% of time
- during 2-GPU training, GPUs usage occilates between 30 and 60%.
Note if any: the same python process (PID) uses both GPUs.
Could you provide me with advice to find where is the bottle-neck in my configuration? Or is there any way to improve multi-GPU training by tuning training parameters?
Many thanks in advance,
Fabien.