Distributed training failing

bazingab · April 4, 2019, 4:59pm

I am trying to do distributed training using 3 worker machines with 1 GPU each and a machine which is the PS.

I use the following command on the workers with changing task indexes(0,1,2) -
./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name worker

And this command for the PS - ./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name ps

I used -
./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name ps --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100.

My PS shows no output.

Worker 1 is proceeding as expected -
12% (24966 of 193284) || Elapsed Time: 5:45:28 12% (24967 of 193284) || Elapsed Time: 5:45:29 12% (24968 of 193284) || Elapsed Time: 5:45:30 12% (24969 of 193284) || Elapsed Time: 5:45:31 12% (24970 of 193284) || Elapsed Time: 5:45:32

Worker 2 shows the following error trace -
"
Preprocessing done
WARNING:tensorflow:From ./DeepSpeech.py:421: SyncReplicasOptimizer. init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will File “/usr/lib/python3.6/http/client.py”, line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File “/usr/lib/python3.6/socket.py”, line 724, in create_connection
raise err
File “/usr/lib/python3.6/socket.py”, line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./DeepSpeech.py”, line 553, in train
job = coord.get_job()
File “/users/DeepSpeech/util/coordinator.py”, line 530, in get_job
result = self._talk_to_chief(PREFIX_GET_JOB + str(FLAGS.task_index))
File “/users/DeepSpeech/util/coordinator.py”, line 445, in _talk_to_chief
res = urllib.request.urlopen(urllib.request.Request(url, data, { ‘content-type’: ‘text/plain’ }))
File “/usr/lib/python3.6/urllib/request.py”, line 223, in urlopen
return opener.open(url, data, timeout)
File “/usr/lib/python3.6/urllib/request.py”, line 526, in open
response = self._open(req, data)
File “/usr/lib/python3.6/urllib/request.py”, line 544, in _open
‘_open’, req)
File “/usr/lib/python3.6/urllib/request.py”, line 504, in _call_chain
result = func(*args)
File “/usr/lib/python3.6/urllib/request.py”, line 1346, in http_open
return self.do_open(http.client.HTTPConnection, req)
File “/usr/lib/python3.6/urllib/request.py”, line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
"

And worker 3 is stuck on -
"
Preprocessing done
WARNING:tensorflow:From ./DeepSpeech.py:421: SyncReplicasOptimizer. init (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The SyncReplicaOptimizer class is deprecated. For synchrononous training, please use Distribution Strategies.
WARNING:tensorflow:From /root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/training/sync_replicas_optimizer.py:352: QueueRunner. init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data module.
"

I tried various steps but of no use.

Command I ran on the 3 workers respectively are as follows by changing the task indexes -
WORKER -1 - ./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 1 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

./DeepSpeech.py --train_files /users/varunimagenet/cv-valid-train.csv --dev_files /users/varunimagenet/cv-valid-dev.csv --test_files /users/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 2 --job_name worker --checkpoint_dir /users/varunimagenet/deepspeech --validation_step 100

I am stuck on this distributed training and would appreciate any help!

If I try to remove worker 3, and add it as a coordination host/port pair using -

./DeepSpeech.py --train_files /users/kshiteej/varunimagenet/cv-valid-train.csv --dev_files /users/kshiteej/varunimagenet/cv-valid-dev.csv --test_files /users/kshiteej/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223 --validation_step 1 --test_batch_size 16 -–coord_host 10.10.1.1 --coord_port 2223

I get the following error - Traceback (most recent call last):
File “./DeepSpeech.py”, line 934, in
tf.app.run(main)
File “/root/tmp/deepspeech-venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “./DeepSpeech.py”, line 894, in main
server = tf.t

Tilman_Kamp · April 5, 2019, 8:19am

Just set
-–coord_host to your first worker:

workers (with changing task indices):

DeepSpeech.py
–train_files … --dev_files … --test_files …
–ps_hosts 10.10.1.2:2223
–worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223
–coord_host 10.10.1.3
–task_index 0
–job_name worker

parameter server:

DeepSpeech.py
–train_files … --dev_files … --test_files …
–ps_hosts 10.10.1.2:2223
–worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223
–coord_host 10.10.1.3
–task_index 0
–job_name ps

I hope this solves your problem.

Sorry about this news, but a couple of days ago I decided to remove distributed training support from DeepSpeech altogether. The PR got already merged. Neither we nor the community really used it and it hindered us to implement far more important productivity features.

bazingab · April 5, 2019, 6:45pm

Thanks for the help Tilman, I have checkout the older code so I think I am good to go for distributed setup.

I tried what you suggested -
./DeepSpeech.py --train_files /users/kshiteej/varunimagenet/cv-valid-train.csv --dev_files /users/kshiteej/varunimagenet/cv-valid-dev.csv --test_files /users/kshiteej/varunimagenet/cv-valid-test.csv --ps_hosts 10.10.1.2:2223 --worker_hosts 10.10.1.3:2223,10.10.1.4:2223,10.10.1.1:2223 --task_index 0 --job_name worker --validation_step 1 --test_batch_size 16 –-coord_host 10.10.1.3

But still I get errors on the 2 non-coordination workers -
Exception in thread Thread-49:
Traceback (most recent call last):
File “/usr/lib/python3.6/urllib/request.py”, line 1318, in do_open
encode_chunked=req.has_header(‘Transfer-encoding’))
File “/usr/lib/python3.6/http/client.py”, line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File “/usr/lib/python3.6/http/client.py”, line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File “/usr/lib/python3.6/http/client.py”, line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File “/usr/lib/python3.6/http/client.py”, line 1026, in _send_output
self.send(msg)
File “/usr/lib/python3.6/http/client.py”, line 964, in send
self.connect()
File “/usr/lib/python3.6/http/client.py”, line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File “/usr/lib/python3.6/socket.py”, line 724, in create_connection
raise err
File “/usr/lib/python3.6/socket.py”, line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/usr/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self._kwargs)
File “/users/kshiteej/DeepSpeech/util/feeding.py”, line 133, in _populate_batch_queue
index = self._data_set.next_index(index) % file_count
File “./DeepSpeech.py”, line 391, in
next_index=lambda i: coord.get_next_index(‘train’))
File “/users/kshiteej/DeepSpeech/util/coordinator.py”, line 478, in get_next_index
value = int(self._talk_to_chief(PREFIX_NEXT_INDEX + set_name))
File “/users/kshiteej/DeepSpeech/util/coordinator.py”, line 445, in _talk_to_chief
res = urllib.request.urlopen(urllib.request.Request(url, data, { ‘content-type’: ‘text/plain’ }))
File “/usr/lib/python3.6/urllib/request.py”, line 223, in urlopen
return opener.open(url, data, timeout)
File “/usr/lib/python3.6/urllib/request.py”, line 526, in open
response = self._open(req, data)
File “/usr/lib/python3.6/urllib/request.py”, line 544, in _open
‘_open’, req)
File “/usr/lib/python3.6/urllib/request.py”, line 504, in _call_chain
result = func(*args)
File “/usr/lib/python3.6/urllib/request.py”, line 1346, in http_open
return self.do_open(http.client.HTTPConnection, req)
File “/usr/lib/python3.6/urllib/request.py”, line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
Exception in thread Thread-51:
Traceback (most recent call last):
File “/usr/lib/python3.6/urllib/request.py”, line 1318, in do_open
encode_chunked=req.has_header(‘Transfer-encoding’))
File “/usr/lib/python3.6/http/client.py”, line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File “/usr/lib/python3.6/http/client.py”, line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File “/usr/lib/python3.6/http/client.py”, line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File “/usr/lib/python3.6/http/client.py”, line 1026, in _send_output
self.send(msg)
File “/usr/lib/python3.6/http/client.py”, line 964, in send
self.connect()
File “/usr/lib/python3.6/http/client.py”, line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File “/usr/lib/python3.6/socket.py”, line 724, in create_connection
raise err
File “/usr/lib/python3.6/socket.py”, line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

… and so on for many threads

Tilman_Kamp · April 12, 2019, 9:45am

To figure out what’s going on in the background, you can add --log_level 0 and --log_traffic to all your DeepSpeech.py calls.

Topic		Replies	Views
Multi-Machine Distributed Training issue DeepSpeech	3	789	April 1, 2019
Distributed Training on a single machine with two GPUs DeepSpeech	11	2072	January 25, 2019
run-cluster on multiple machines not working DeepSpeech	0	545	July 9, 2018
Distributed training DeepSpeech	11	3344	December 17, 2018
Distributed Training on Multiple Machines, Multiple GPUs DeepSpeech	1	219	September 28, 2020

Distributed training failing

workers (with changing task indices):

parameter server:

Related topics