Unable to train on multiple GPU

iyer.sujatha94 · January 31, 2019, 1:35pm

I have 2 GPU (GeForce 1080Ti) in one box. I tried to run Deepspeech on roughly 20 minutes of data. (1178 files each spanning 1 second, 800 in train set rest in dev set) . Running this on Single GPU seems fine. But on multi GPU, the loss is always Nan and I get the error Nan in histogram summary . These are the following hyperparameters that I have used for both single and multi GPU

–n_hidden 494
–learning_rate 0.0001
–train_batch_size 4
–dev_batch_size 2

Tensorflow version 1.12.0

I tried running Deepspeech.py on single GPU and it works fine

Training of Epoch 0 - loss: 4.271649
Training of Epoch 1 - loss: 0.046320
Training of Epoch 2 - loss: 0.010699

The same when run on 2 GPU, I get this -

Training of Epoch 0 - loss: nan
Training of Epoch 1 - loss: nan
At the end I get this Nan in summary histogram for: b6_0

I tried using run-cluster.sh . After the pre-processing, it hangs at this -

[worker 1] Instructions for updating:
[worker 1] To construct input pipelines, use the tf.data module.
[worker 1] W Parameter --validation_step needs to be >0 for early stopping to work

Please do let me know if I am missing out on something

reuben · January 31, 2019, 1:47pm

The algorithmic batch size is the batch size specified in the flags times the number of GPUs being used for training, so there’s interaction between number of GPUs, learning rate and batch size that can affect training. In your case, you doubled the batch size so it’s a bit unexpected that it causes divergence. Maybe unfortunate initialization? Try changing the random seed. If that doesn’t make any difference, lower the learning rate.

Topic		Replies	Views
Optimization of Deepspeech training in multi-GPU environment DeepSpeech learning	4	1061	April 14, 2021
Long Training Time DeepSpeech	13	630	April 14, 2020
Distributed Training on a single machine with two GPUs DeepSpeech	11	2095	January 25, 2019
How to specify it to run on single/idle gpu only? DeepSpeech	9	1158	July 11, 2018
Deepspeech does not seem to use gpu while training, however does use it when using native-client DeepSpeech	17	1784	November 19, 2020

Unable to train on multiple GPU

Related topics