Distributed training


(chesterkuo) #1

Hi there

I had finished the distributed training across multiple PC with multiple worker , each worker have their own checkpoint folder specific as well, question i’m not sure is , how to export model for inference with cluster case ??

Check each worker checkpoint file and see which one have latest checkpoint file and export model ??? any suggestion ?


Distributed training set up of DeepSpeech code
(Jageshmaharjan) #2

Hi @chesterkuo, How did u add server ip for distributed training.
I did this but i encounter errors, maybe I am not doing wright way. Can you point what’s wong in this script.

 python -u DeepSpeech.py \
  --train_files /data/zh_data/data_thchs30/train.csv \
  --dev_files /data/zh_data/data_thchs30/dev.csv \
  --test_files /data/zh_data/data_thchs30/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 200 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /data/zh_data/exportDir/ \
  --checkpoint_dir /data/zh_data/checkpoint/ \
  --decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /data/zh_data/alphabet.txt \
  --lm_binary_path /data/zh_data/zh_lm.binary \
  --lm_trie_path /data/zh_data/trie \
  --ps_hosts "104.211.xx.xx:2222" \

The error is on --ps_host parameter. If its not that way to assign parameter server, how should i do. And, my error is:

Traceback (most recent call last):
  File "DeepSpeech.py", line 1838, in <module>
    tf.app.run()
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 1795, in main
    train()
  File "DeepSpeech.py", line 1501, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "DeepSpeech.py", line 633, in get_tower_results
    device = tf.train.replica_device_setter(worker_device=available_devices[i], cluster=cluster)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/device_setter.py", line 197, in replica_device_setter
    cluster_spec = cluster.as_dict()
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 334, in as_dict
    if max(task_indices) + 1 == len(task_indices):
ValueError: max() arg is an empty sequence

(Tilman Kamp) #3

@jageshmaharjan Have you forgotten to assign the --worker_hosts parameter?


(Jageshmaharjan) #4

I have it in my script , Sorry, i think i didn’t paste here.
Lemme edit again. @Tilman_Kamp

This is the host server (ps) and this is my script:

python -u DeepSpeech.py
–train_files /data/zh_data/data_thchs30/train.csv
–dev_files /data/zh_data/data_thchs30/dev.csv
–test_files /data/zh_data/data_thchs30/test.csv
–train_batch_size 80
–dev_batch_size 80
–test_batch_size 40
–n_hidden 375
–epoch 200
–validation_step 1
–early_stop True
–earlystop_nsteps 6
–estop_mean_thresh 0.1
–estop_std_thresh 0.1
–dropout_rate 0.22
–learning_rate 0.00095
–report_count 100
–use_seq_length False
–export_dir /data/zh_data/exportDir/distributedTf/
–checkpoint_dir /data/zh_data/checkpoint/distributedCkp/
–decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /data/zh_data/alphabet.txt
–lm_binary_path /data/zh_data/zh_lm.binary
–lm_trie_path /data/zh_data/trie
–ps_hosts localhost:2233
–worker_hosts localhost:2222
–task_index 0
–job_name ps \

And, this is my worker machine, and this is the script:

python -u DeepSpeech.py
–train_files /data/zh_data/data_thchs30/train.csv
–dev_files /data/zh_data/data_thchs30/dev.csv
–test_files /data/zh_data/data_thchs30/test.csv
–train_batch_size 80
–dev_batch_size 80
–test_batch_size 40
–n_hidden 375
–epoch 200
–validation_step 1
–early_stop True
–earlystop_nsteps 6
–estop_mean_thresh 0.1
–estop_std_thresh 0.1
–dropout_rate 0.22
–learning_rate 0.00095
–report_count 100
–use_seq_length False
–export_dir /data/zh_data/exportDir/distributedTf/
–checkpoint_dir /data/zh_data/checkpoint/distributedCkp/
–decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /data/zh_data/alphabet.txt
–lm_binary_path /data/zh_data/zh_lm.binary
–lm_trie_path /data/zh_data/trie
–ps_hosts localhost:2233
–worker_hosts localhost:2222
–task_index 0
–job_name worker
–coord_host localhost
–coord_port 2222


(Tilman Kamp) #5

Ah OK. Far more complete now. Two things:

  • All instances have to get the same –coord_host and –coord_port parameters (one coodination service all talk to).
  • The coordination service host and port combination should not be used by any worker or ps.

(Jageshmaharjan) #6

Hi @Tilman_Kamp , the rest of the script are same. Just changed the port of the coordinator.
And, I was getting this Error trace.

In worker machine:

Instructions for updating:
Use the retry module or similar alternatives.
E0523 11:47:02.841798575 86717 server_chttp2.cc:38] {“created”:"@1527076022.841733775",“description”:“No address added out of total 1 resolved”,“file”:“external/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc”,“file_line”:302,“referenced_errors”:[{“created”:"@1527076022.841731175",“description”:“Failed to add any wildcard listeners”,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_posix.cc”,“file_line”:337,“referenced_errors”:[{“created”:"@1527076022.841714076",“description”:“Unable to configure socket”,“fd”:53,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:200,“referenced_errors”:[{“created”:"@1527076022.841709676",“description”:“OS Error”,“errno”:98,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:173,“os_error”:“Address already in use”,“syscall”:“bind”}]},{“created”:"@1527076022.841730775",“description”:“Unable to configure socket”,“fd”:53,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:200,“referenced_errors”:[{“created”:"@1527076022.841727075",“description”:“OS Error”,“errno”:98,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:173,“os_error”:“Address already in use”,“syscall”:“bind”}]}]}]}
Traceback (most recent call last):
File “DeepSpeech.py”, line 1838, in
tf.app.run()
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 126, in run
_sys.exit(main(argv))
File “DeepSpeech.py”, line 1799, in main
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py”, line 147, in init
self._server_def.SerializeToString(), status)
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py”, line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server

and in Host machine (PS), it doesn’t show anything than this,

Instructions for updating:
Use the retry module or similar alternatives.


(Jageshmaharjan) #7

Oh, Never-mind @Tilman_Kamp, It was in the background consuming that port. I killed the process, now, I think it started to train.
Yea, I changed the coordinator port and I think it works fine. Thanks @Tilman_Kamp. :slight_smile:


(Jageshmaharjan) #8

Hi @Tilman_Kamp, got another error after sometime during training.

2018-05-24 03:05:38.015799: F tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
Aborted (core dumped)

Thinking, unallocated GPU usage by previous process. I re-ran the worker script. Got another traces.

E You must feed a value for placeholder tensor ‘Placeholder_5’ with dtype int32
E [[Node: Placeholder_5 = Placeholderdtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]]
E [[Node: b3/read_S591_G3013 = _Recvclient_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name=“edge_2390_b3/read_S591”, tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]]
E
E Caused by op ‘Placeholder_5’, defined at:
E File “DeepSpeech.py”, line 1838, in
E tf.app.run()
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 126, in run
E _sys.exit(main(argv))
E File “DeepSpeech.py”, line 1820, in main
E train(server)
E File “DeepSpeech.py”, line 1489, in train
E tower_feeder_count=len(available_devices))
E File “/data/jugs/asr/DeepSpeech/util/feeding.py”, line 42, in init
E self.ph_batch_size = tf.placeholder(tf.int32, [])
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py”, line 1777, in placeholder
E return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py”, line 4521, in placeholder
E “Placeholder”, dtype=dtype, shape=shape, name=name)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
E op_def=op_def)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 3290, in create_op
E op_def=op_def)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1654, in init
E self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
E
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor ‘Placeholder_5’ with dtype int32
E [[Node: Placeholder_5 = Placeholderdtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]]
E [[Node: b3/read_S591_G3013 = _Recvclient_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name=“edge_2390_b3/read_S591”, tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]]
E
E The checkpoint in /data/zh_data/checkpoint/distributedCkp/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /data/zh_data/checkpoint/distributedCkp/.