Distributed Training on a single machine with two GPUs


(Rpratesh) #1

Hi,

I am trying to use run-cluster.sh script provided in DeepSpeech’s code for distributed training.
My workstation has two NVIDIA 1080 GPUS over PCIe slots.

When I run the following command

./run-cluster.sh 1:2:1 --train_files /docker_files/voxforge/voxforge-train.csv --dev_files /docker_files/voxforge/voxforge-dev.csv --test_files /docker_files/voxforge/voxforge-test.csv --checkpoint_dir /docker_files/checkpoints_cv_mozilla/ --epoch -3 --n_hidden 2048

where Deepspeech’s trained checkpoints (downloaded from Github) are in /docker_files/checkpoints_cv_mozilla/. I have two workers each for one GPU and one parameter server.

But after I run the above command, the processes are stuck after certain point. Here’s the output -

[worker 0] Preprocessing done
[worker 0] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-dev.csv’])
[worker 1] Preprocessing done
[worker 1] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-dev.csv’])
[worker 0] Preprocessing done
[worker 0] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-test.csv’])
[worker 1] Preprocessing done
[worker 1] (‘Preprocessing’, [’/docker_files/voxforge/voxforge-test.csv’])
[worker 0] Preprocessing done
[worker 0] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py:335: init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
[worker 0] Instructions for updating:
[worker 0] To construct input pipelines, use the tf.data module.
[worker 1] Preprocessing done
[worker 1] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py:335: init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
[worker 1] Instructions for updating:
[worker 1] To construct input pipelines, use the tf.data module.

The processors use up all the memory of both GPUs but volatile-memory usage stays at 0 to 2% always, which means no training seems to be initiated.

Wanted to know, if there’s anything that I’m missing in this process.


(Lissyx) #2

How long has this been running ? Can you compare without run-cluster.sh ? Since you have all GPUs on one machine, they should work without the distributed setup.


(Rpratesh) #3

I’ve waited for more than 12 hours, it’s stuck there.

Yes, I’ve tried without run-cluster.sh and training is running fine though it seems to take almost 18 hours for each epoch.

Thought, using the cluster script would fetch faster training, instead of leaving it to Tensorflow’s automatic allocation.


(Tilman Kamp) #4

run-cluster.sh is more a blueprint, test or example script for an actual distributed setup and not a production-ready way to train on more than one GPU. DeepSpeech.py is already multi GPU capable on the local machine and is the way to go if you don’t plan for multiple machines and gradient exchange over the network.


(Rpratesh) #5

@Tilman_Kamp @lissyx
Any suggestions on how to modify the run-cluster.sh script for two machines connected over network and each machine has two GPUs on their PCIe slots.

Can u mention the steps to be followed in above scenario, as there seems to be little help in Tensorflow community on how to train over loosely coupled machines.


(Tilman Kamp) #6

You have to start DeepSpeech.py for at least one parameter-server instance and all workers on all involved boxes in parallel (e.g. by SSH) and pass the following parameters (additionally to your normal training ones):

  • --ps_hosts and --worker_hosts: Each of them a comma separated lists of all your parameter-servers and all your worker instances as <host>:<port> pairs. So all instances know each other.
  • ---job_name=worker for all workers and --job_name=ps for all parameter servers. So each instance knows its role.
  • --task_index=<index> for the index of this individual instance within the --ps_hosts and --worker_hosts lists. So each instance knows which particular instance it is in the formed cluster.
  • --coord_port=<some-port> which will tell “worker 0” (who always has the job as a coordinator among all other instances) which port to open and tell the others to which port they have to connect to access the coordination service of worker 0.

With that said: If you now look at the script again, you’ll see that it is a demonstration and local-only. Its primary purpose is to give you a first impression on how things would/should look like in a real scenario.


(Hashim) #7

[quote=“Tilman_Kamp, post:4, topic:32995, full:true”] DeepSpeech.py is already multi GPU capable on the local machine
[/quote]

I have four GPU in one machine, and when i train it, and the training is very slow, from which i conclude that it uses only the first one.
What specific argument should i need to give when training with multiple GPU as it doesnt support it by default


(Reuben Morais) #8

Training being slow is not enough evidence to conclude it’s only using one GPU, you should look at nvidia-smi during training. DeepSpeech.py uses all GPUs by default.


(Hashim) #9

Yeah I agree. Probably I missed a point to report. I have tested the same code in another machine with one GPU and the time is relatively same as the other machine with four GPU,

Iam away from my pc, I will check nvidia-smi and update…


(Jose Fernandez) #10

yeah i agree. me too


(Hashim) #11

@reuben Thanks for your prompt response. It was an error that tensorflow was used instead of tf-gpu. However it is resolved now, But a new issue has raised.

Training runs smoothly using single GPU, But, using two GPU, the training seems to hangs on following line for hours.

Preprocessing ['data/CV/cv-valid-train.csv']
Preprocessing done
Preprocessing ['data/CV/cv-valid-dev.csv']
Preprocessing done
W Parameter --validation_step needs to be >0 for early stopping to work

During training with two GPU, the output of nvidia-smi is
nvidia-smi Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

Normally, before training, nvidia-smi gives

nvidia-smi
Sat Jan 12 08:38:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:03:00.0 Off |                  N/A |
| 40%   37C    P0    32W / 120W |    184MiB /  6078MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:05:00.0 Off |                  N/A |
| 40%   31C    P8     6W / 120W |      2MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1370      G   /usr/lib/xorg/Xorg                            77MiB |
|    0      1449      G                                                 10MiB |
|    0      1597      G   /usr/bin/gnome-shell                          94MiB |
+-----------------------------------------------------------------------------+


(Lissyx) #12

FYI, two GPUs here and no problem