I had finished the distributed training across multiple PC with multiple worker , each worker have their own checkpoint folder specific as well, question i’m not sure is , how to export model for inference with cluster case ??
Check each worker checkpoint file and see which one have latest checkpoint file and export model ??? any suggestion ?
Hi @chesterkuo, How did u add server ip for distributed training.
I did this but i encounter errors, maybe I am not doing wright way. Can you point what’s wong in this script.
The error is on --ps_host parameter. If its not that way to assign parameter server, how should i do. And, my error is:
Traceback (most recent call last):
File "DeepSpeech.py", line 1838, in <module>
tf.app.run()
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 1795, in main
train()
File "DeepSpeech.py", line 1501, in train
results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
File "DeepSpeech.py", line 633, in get_tower_results
device = tf.train.replica_device_setter(worker_device=available_devices[i], cluster=cluster)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/device_setter.py", line 197, in replica_device_setter
cluster_spec = cluster.as_dict()
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 334, in as_dict
if max(task_indices) + 1 == len(task_indices):
ValueError: max() arg is an empty sequence
Hi @Tilman_Kamp , the rest of the script are same. Just changed the port of the coordinator.
And, I was getting this Error trace.
In worker machine:
Instructions for updating:
Use the retry module or similar alternatives.
E0523 11:47:02.841798575 86717 server_chttp2.cc:38] {“created”:“@1527076022.841733775”,“description”:“No address added out of total 1 resolved”,“file”:“external/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc”,“file_line”:302,“referenced_errors”:[{“created”:“@1527076022.841731175”,“description”:“Failed to add any wildcard listeners”,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_posix.cc”,“file_line”:337,“referenced_errors”:[{“created”:“@1527076022.841714076”,“description”:“Unable to configure socket”,“fd”:53,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:200,“referenced_errors”:[{“created”:“@1527076022.841709676”,“description”:“OS Error”,“errno”:98,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:173,“os_error”:“Address already in use”,“syscall”:“bind”}]},{“created”:“@1527076022.841730775”,“description”:“Unable to configure socket”,“fd”:53,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:200,“referenced_errors”:[{“created”:“@1527076022.841727075”,“description”:“OS Error”,“errno”:98,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:173,“os_error”:“Address already in use”,“syscall”:“bind”}]}]}]}
Traceback (most recent call last):
File “DeepSpeech.py”, line 1838, in
tf.app.run()
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 126, in run
_sys.exit(main(argv))
File “DeepSpeech.py”, line 1799, in main
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py”, line 147, in init
self._server_def.SerializeToString(), status)
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py”, line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server
and in Host machine (PS), it doesn’t show anything than this,
Instructions for updating:
Use the retry module or similar alternatives.
Oh, Never-mind @Tilman_Kamp, It was in the background consuming that port. I killed the process, now, I think it started to train.
Yea, I changed the coordinator port and I think it works fine. Thanks @Tilman_Kamp.