I am trying to use run-cluster.sh script provided in DeepSpeech’s code for distributed training.
My workstation has two NVIDIA 1080 GPUS over PCIe slots.
where Deepspeech’s trained checkpoints (downloaded from Github) are in /docker_files/checkpoints_cv_mozilla/. I have two workers each for one GPU and one parameter server.
But after I run the above command, the processes are stuck after certain point. Here’s the output -
[worker 0] Preprocessing done
[worker 0] (‘Preprocessing’, [‘/docker_files/voxforge/voxforge-dev.csv’])
[worker 1] Preprocessing done
[worker 1] (‘Preprocessing’, [‘/docker_files/voxforge/voxforge-dev.csv’])
[worker 0] Preprocessing done
[worker 0] (‘Preprocessing’, [‘/docker_files/voxforge/voxforge-test.csv’])
[worker 1] Preprocessing done
[worker 1] (‘Preprocessing’, [‘/docker_files/voxforge/voxforge-test.csv’])
[worker 0] Preprocessing done
[worker 0] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py:335: init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
[worker 0] Instructions for updating:
[worker 0] To construct input pipelines, use the tf.data module.
[worker 1] Preprocessing done
[worker 1] WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py:335: init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
[worker 1] Instructions for updating:
[worker 1] To construct input pipelines, use the tf.data module.
The processors use up all the memory of both GPUs but volatile-memory usage stays at 0 to 2% always, which means no training seems to be initiated.
Wanted to know, if there’s anything that I’m missing in this process.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
How long has this been running ? Can you compare without run-cluster.sh ? Since you have all GPUs on one machine, they should work without the distributed setup.
run-cluster.sh is more a blueprint, test or example script for an actual distributed setup and not a production-ready way to train on more than one GPU. DeepSpeech.py is already multi GPU capable on the local machine and is the way to go if you don’t plan for multiple machines and gradient exchange over the network.
@Tilman_Kamp@lissyx
Any suggestions on how to modify the run-cluster.sh script for two machines connected over network and each machine has two GPUs on their PCIe slots.
Can u mention the steps to be followed in above scenario, as there seems to be little help in Tensorflow community on how to train over loosely coupled machines.
You have to start DeepSpeech.py for at least one parameter-server instance and all workers on all involved boxes in parallel (e.g. by SSH) and pass the following parameters (additionally to your normal training ones):
--ps_hosts and --worker_hosts: Each of them a comma separated lists of all your parameter-servers and all your worker instances as <host>:<port> pairs. So all instances know each other.
---job_name=worker for all workers and --job_name=ps for all parameter servers. So each instance knows its role.
--task_index=<index> for the index of this individual instance within the --ps_hosts and --worker_hosts lists. So each instance knows which particular instance it is in the formed cluster.
--coord_port=<some-port> which will tell “worker 0” (who always has the job as a coordinator among all other instances) which port to open and tell the others to which port they have to connect to access the coordination service of worker 0.
With that said: If you now look at the script again, you’ll see that it is a demonstration and local-only. Its primary purpose is to give you a first impression on how things would/should look like in a real scenario.
[quote=“Tilman_Kamp, post:4, topic:32995, full:true”] DeepSpeech.py is already multi GPU capable on the local machine
[/quote]
I have four GPU in one machine, and when i train it, and the training is very slow, from which i conclude that it uses only the first one.
What specific argument should i need to give when training with multiple GPU as it doesnt support it by default
Training being slow is not enough evidence to conclude it’s only using one GPU, you should look at nvidia-smi during training. DeepSpeech.py uses all GPUs by default.
Yeah I agree. Probably I missed a point to report. I have tested the same code in another machine with one GPU and the time is relatively same as the other machine with four GPU,
Iam away from my pc, I will check nvidia-smi and update…
@reuben Thanks for your prompt response. It was an error that tensorflow was used instead of tf-gpu. However it is resolved now, But a new issue has raised.
Training runs smoothly using single GPU, But, using two GPU, the training seems to hangs on following line for hours.
Preprocessing ['data/CV/cv-valid-train.csv']
Preprocessing done
Preprocessing ['data/CV/cv-valid-dev.csv']
Preprocessing done
W Parameter --validation_step needs to be >0 for early stopping to work
During training with two GPU, the output of nvidia-smi is nvidia-smi Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU
Normally, before training, nvidia-smi gives
nvidia-smi
Sat Jan 12 08:38:15 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87 Driver Version: 390.87 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:03:00.0 Off | N/A |
| 40% 37C P0 32W / 120W | 184MiB / 6078MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... Off | 00000000:05:00.0 Off | N/A |
| 40% 31C P8 6W / 120W | 2MiB / 6078MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1370 G /usr/lib/xorg/Xorg 77MiB |
| 0 1449 G 10MiB |
| 0 1597 G /usr/bin/gnome-shell 94MiB |
+-----------------------------------------------------------------------------+
’
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
12