Multi-Machine Distributed Training issue

I am trying to do distributed training using 3 worker machines with 1 GPU each and a machine which is the PS.

I use the following command on the workers with changing task indexes(0,1,2) -
./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name worker

And this command for the PS - ./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name ps

The 3 workers show Preprocessing [’/users/varunimagenet/cv-valid-train.csv’] as an output and the PS worker shows nothing.

When I tried to look at the GPU/CPU usage, I found that preprocessing is not using any GPU. I had uninstalled tensorflow and installed tensorflow-gpu.

Is this how it is supposed to be? Can I do preprocessing on GPU to speed up?

Preprocessing can take a long time for big datasets, yes, and it is all done on the CPU. So what you’re seeing is expected behavior. You can save the preprocessed features to disk to speed things up considerably in future runs. Check out the --train_cached_features_path flag.

Thanks for the help! I am stuck after this step.

I used -
./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name ps --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100.

My PS shows no output.

Worker 1 is proceeding as expected -
12% (24966 of 193284) || Elapsed Time: 5:45:28 12% (24967 of 193284) || Elapsed Time: 5:45:29 12% (24968 of 193284) || Elapsed Time: 5:45:30 12% (24969 of 193284) || Elapsed Time: 5:45:31 12% (24970 of 193284) || Elapsed Time: 5:45:32

Worker 2 shows the following error trace -
"
Preprocessing done
WARNING:tensorflow:From ./DeepSpeech.py:421: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will File “/usr/lib/python3.6/http/client.py”, line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File “/usr/lib/python3.6/socket.py”, line 724, in create_connection
raise err
File “/usr/lib/python3.6/socket.py”, line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./DeepSpeech.py”, line 553, in train
job = coord.get_job()
File “/users/DeepSpeech/util/coordinator.py”, line 530, in get_job
result = self._talk_to_chief(PREFIX_GET_JOB + str(FLAGS.task_index))
File “/users/DeepSpeech/util/coordinator.py”, line 445, in _talk_to_chief
res = urllib.request.urlopen(urllib.request.Request(url, data, { ‘content-type’: ‘text/plain’ }))
File “/usr/lib/python3.6/urllib/request.py”, line 223, in urlopen
return opener.open(url, data, timeout)
File “/usr/lib/python3.6/urllib/request.py”, line 526, in open
response = self._open(req, data)
File “/usr/lib/python3.6/urllib/request.py”, line 544, in _open
‘_open’, req)
File “/usr/lib/python3.6/urllib/request.py”, line 504, in _call_chain
result = func(*args)
File “/usr/lib/python3.6/urllib/request.py”, line 1346, in http_open
return self.do_open(http.client.HTTPConnection, req)
File “/usr/lib/python3.6/urllib/request.py”, line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
"

And worker 3 is stuck on -
"
Preprocessing done
WARNING:tensorflow:From ./DeepSpeech.py:421: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
WARNING:tensorflow:From /root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/training/sync_replicas_optimizer.py:352: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
"

I tried various steps but of no use.

Command I ran on the 3 workers respectively are as follows by changing the task indexes -
WORKER -1 - ./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 1 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 2 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

I am stuck on this distributed training and would appreciate any help!

If I try to remove worker 3, and add it as a coordination host/port pair using -

./DeepSpeech.py --train_files /users/kshiteej/varunimagenet/cv-valid-train.csv --dev_files /users/kshiteej/varunimagenet/cv-valid-dev.csv --test_files /users/kshiteej/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223 --validation_step 1 --test_batch_size 16 -–coord_host 10.10.1.1 --coord_port 2223

I get the following error - Traceback (most recent call last):
File “./DeepSpeech.py”, line 934, in
tf.app.run(main)
File “/root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “./DeepSpeech.py”, line 894, in main
server = tf.train.Server(Config.cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
File “/root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py”, line 148, in init
self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.InternalError: Job “localhost” was not defined in cluster

Am I missing out on something?

Thanks for the help! I am stuck after this step.

I used -
./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name ps --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100.

My PS shows no output.

Worker 1 is proceeding as expected -
12% (24966 of 193284) || Elapsed Time: 5:45:28 12% (24967 of 193284) || Elapsed Time: 5:45:29 12% (24968 of 193284) || Elapsed Time: 5:45:30 12% (24969 of 193284) || Elapsed Time: 5:45:31 12% (24970 of 193284) || Elapsed Time: 5:45:32

Worker 2 shows the following error trace -
"
Preprocessing done
WARNING:tensorflow:From ./DeepSpeech.py:421: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will File “/usr/lib/python3.6/http/client.py”, line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File “/usr/lib/python3.6/socket.py”, line 724, in create_connection
raise err
File “/usr/lib/python3.6/socket.py”, line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./DeepSpeech.py”, line 553, in train
job = coord.get_job()
File “/users/DeepSpeech/util/coordinator.py”, line 530, in get_job
result = self._talk_to_chief(PREFIX_GET_JOB + str(FLAGS.task_index))
File “/users/DeepSpeech/util/coordinator.py”, line 445, in _talk_to_chief
res = urllib.request.urlopen(urllib.request.Request(url, data, { ‘content-type’: ‘text/plain’ }))
File “/usr/lib/python3.6/urllib/request.py”, line 223, in urlopen
return opener.open(url, data, timeout)
File “/usr/lib/python3.6/urllib/request.py”, line 526, in open
response = self._open(req, data)
File “/usr/lib/python3.6/urllib/request.py”, line 544, in _open
‘_open’, req)
File “/usr/lib/python3.6/urllib/request.py”, line 504, in _call_chain
result = func(*args)
File “/usr/lib/python3.6/urllib/request.py”, line 1346, in http_open
return self.do_open(http.client.HTTPConnection, req)
File “/usr/lib/python3.6/urllib/request.py”, line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
"

And worker 3 is stuck on -
"
Preprocessing done
WARNING:tensorflow:From ./DeepSpeech.py:421: SyncReplicasOptimizer.init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
WARNING:tensorflow:From /root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/training/sync_replicas_optimizer.py:352: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
"

I tried various steps but of no use.

Command I ran on the 3 workers respectively are as follows by changing the task indexes -
WORKER -1 - ./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 1 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 2 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

I am stuck on this distributed training and would appreciate any help!

If I try to remove worker 3, and add it as a coordination host/port pair using -

./DeepSpeech.py --train_files /users/kshiteej/varunimagenet/cv-valid-train.csv --dev_files /users/kshiteej/varunimagenet/cv-valid-dev.csv --test_files /users/kshiteej/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223 --validation_step 1 --test_batch_size 16 -–coord_host 10.10.1.1 --coord_port 2223

I get the following error - Traceback (most recent call last):
File “./DeepSpeech.py”, line 934, in
tf.app.run(main)
File “/root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “./DeepSpeech.py”, line 894, in main
server = tf.t