Distributed training

chesterkuo · February 8, 2018, 10:06am

Hi there

I had finished the distributed training across multiple PC with multiple worker , each worker have their own checkpoint folder specific as well, question i’m not sure is , how to export model for inference with cluster case ??

Check each worker checkpoint file and see which one have latest checkpoint file and export model ??? any suggestion ?

jageshmaharjan · May 16, 2018, 10:42am

Hi @chesterkuo, How did u add server ip for distributed training.
I did this but i encounter errors, maybe I am not doing wright way. Can you point what’s wong in this script.

 python -u DeepSpeech.py \
  --train_files /data/zh_data/data_thchs30/train.csv \
  --dev_files /data/zh_data/data_thchs30/dev.csv \
  --test_files /data/zh_data/data_thchs30/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 200 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /data/zh_data/exportDir/ \
  --checkpoint_dir /data/zh_data/checkpoint/ \
  --decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /data/zh_data/alphabet.txt \
  --lm_binary_path /data/zh_data/zh_lm.binary \
  --lm_trie_path /data/zh_data/trie \
  --ps_hosts "104.211.xx.xx:2222" \

The error is on --ps_host parameter. If its not that way to assign parameter server, how should i do. And, my error is:

Traceback (most recent call last):
  File "DeepSpeech.py", line 1838, in <module>
    tf.app.run()
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 1795, in main
    train()
  File "DeepSpeech.py", line 1501, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "DeepSpeech.py", line 633, in get_tower_results
    device = tf.train.replica_device_setter(worker_device=available_devices[i], cluster=cluster)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/device_setter.py", line 197, in replica_device_setter
    cluster_spec = cluster.as_dict()
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 334, in as_dict
    if max(task_indices) + 1 == len(task_indices):
ValueError: max() arg is an empty sequence

Tilman_Kamp · May 22, 2018, 8:36am

@jageshmaharjan Have you forgotten to assign the --worker_hosts parameter?

jageshmaharjan · May 22, 2018, 8:54am

I have it in my script , Sorry, i think i didn’t paste here.
Lemme edit again. @Tilman_Kamp

This is the host server (ps) and this is my script:

python -u DeepSpeech.py
–train_files /data/zh_data/data_thchs30/train.csv
–dev_files /data/zh_data/data_thchs30/dev.csv
–test_files /data/zh_data/data_thchs30/test.csv
–train_batch_size 80
–dev_batch_size 80
–test_batch_size 40
–n_hidden 375
–epoch 200
–validation_step 1
–early_stop True
–earlystop_nsteps 6
–estop_mean_thresh 0.1
–estop_std_thresh 0.1
–dropout_rate 0.22
–learning_rate 0.00095
–report_count 100
–use_seq_length False
–export_dir /data/zh_data/exportDir/distributedTf/
–checkpoint_dir /data/zh_data/checkpoint/distributedCkp/
–decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /data/zh_data/alphabet.txt
–lm_binary_path /data/zh_data/zh_lm.binary
–lm_trie_path /data/zh_data/trie
–ps_hosts localhost:2233
–worker_hosts localhost:2222
–task_index 0
–job_name ps \

And, this is my worker machine, and this is the script:

python -u DeepSpeech.py
–train_files /data/zh_data/data_thchs30/train.csv
–dev_files /data/zh_data/data_thchs30/dev.csv
–test_files /data/zh_data/data_thchs30/test.csv
–train_batch_size 80
–dev_batch_size 80
–test_batch_size 40
–n_hidden 375
–epoch 200
–validation_step 1
–early_stop True
–earlystop_nsteps 6
–estop_mean_thresh 0.1
–estop_std_thresh 0.1
–dropout_rate 0.22
–learning_rate 0.00095
–report_count 100
–use_seq_length False
–export_dir /data/zh_data/exportDir/distributedTf/
–checkpoint_dir /data/zh_data/checkpoint/distributedCkp/
–decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /data/zh_data/alphabet.txt
–lm_binary_path /data/zh_data/zh_lm.binary
–lm_trie_path /data/zh_data/trie
–ps_hosts localhost:2233
–worker_hosts localhost:2222
–task_index 0
–job_name worker
–coord_host localhost
–coord_port 2222

Tilman_Kamp · May 22, 2018, 9:42am

Ah OK. Far more complete now. Two things:

All instances have to get the same –coord_host and –coord_port parameters (one coodination service all talk to).
The coordination service host and port combination should not be used by any worker or ps.

jageshmaharjan · May 23, 2018, 11:55am

Hi @Tilman_Kamp , the rest of the script are same. Just changed the port of the coordinator.
And, I was getting this Error trace.

In worker machine:

Instructions for updating:
Use the retry module or similar alternatives.
E0523 11:47:02.841798575 86717 server_chttp2.cc:38] {“created”:"@1527076022.841733775",“description”:“No address added out of total 1 resolved”,“file”:“external/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc”,“file_line”:302,“referenced_errors”:[{“created”:"@1527076022.841731175",“description”:“Failed to add any wildcard listeners”,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_posix.cc”,“file_line”:337,“referenced_errors”:[{“created”:"@1527076022.841714076",“description”:“Unable to configure socket”,“fd”:53,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:200,“referenced_errors”:[{“created”:"@1527076022.841709676",“description”:“OS Error”,“errno”:98,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:173,“os_error”:“Address already in use”,“syscall”:“bind”}]},{“created”:"@1527076022.841730775",“description”:“Unable to configure socket”,“fd”:53,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:200,“referenced_errors”:[{“created”:"@1527076022.841727075",“description”:“OS Error”,“errno”:98,“file”:“external/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:173,“os_error”:“Address already in use”,“syscall”:“bind”}]}]}]}
Traceback (most recent call last):
File “DeepSpeech.py”, line 1838, in
tf.app.run()
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 126, in run
_sys.exit(main(argv))
File “DeepSpeech.py”, line 1799, in main
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py”, line 147, in init
self._server_def.SerializeToString(), status)
File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py”, line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server

and in Host machine (PS), it doesn’t show anything than this,

Instructions for updating:
Use the retry module or similar alternatives.

jageshmaharjan · May 23, 2018, 12:05pm

Oh, Never-mind @Tilman_Kamp, It was in the background consuming that port. I killed the process, now, I think it started to train.
Yea, I changed the coordinator port and I think it works fine. Thanks @Tilman_Kamp.

jageshmaharjan · May 24, 2018, 3:13am

Hi @Tilman_Kamp, got another error after sometime during training.

2018-05-24 03:05:38.015799: F tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
Aborted (core dumped)

Thinking, unallocated GPU usage by previous process. I re-ran the worker script. Got another traces.

E You must feed a value for placeholder tensor ‘Placeholder_5’ with dtype int32
E [[Node: Placeholder_5 = Placeholderdtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]]
E [[Node: b3/read_S591_G3013 = _Recvclient_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name=“edge_2390_b3/read_S591”, tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]]
E
E Caused by op ‘Placeholder_5’, defined at:
E File “DeepSpeech.py”, line 1838, in
E tf.app.run()
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 126, in run
E _sys.exit(main(argv))
E File “DeepSpeech.py”, line 1820, in main
E train(server)
E File “DeepSpeech.py”, line 1489, in train
E tower_feeder_count=len(available_devices))
E File “/data/jugs/asr/DeepSpeech/util/feeding.py”, line 42, in init
E self.ph_batch_size = tf.placeholder(tf.int32, [])
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py”, line 1777, in placeholder
E return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py”, line 4521, in placeholder
E “Placeholder”, dtype=dtype, shape=shape, name=name)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
E op_def=op_def)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 3290, in create_op
E op_def=op_def)
E File “/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1654, in init
E self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
E
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor ‘Placeholder_5’ with dtype int32
E [[Node: Placeholder_5 = Placeholderdtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]]
E [[Node: b3/read_S591_G3013 = _Recvclient_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name=“edge_2390_b3/read_S591”, tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]]
E
E The checkpoint in /data/zh_data/checkpoint/distributedCkp/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /data/zh_data/checkpoint/distributedCkp/.

gr8nishan · December 13, 2018, 9:10am

Hi @jageshmaharjan can you please share both the scripts that is working for you.

jageshmaharjan · December 13, 2018, 10:47am

Hi @gr8nishan. There could be two reasons to have that error.

Not available GPU memory (either being used by other process or not enough GPU)
The path (directory) for saving checkpoints already have checkpoints with different hyperparameters.

Also, make sure, GitLarge File system (gfs) is installed.

gr8nishan · December 13, 2018, 12:02pm

HI @jageshmaharjan i am trying to do distributed training across CPUs . I believe that should be possible.

jageshmaharjan · December 17, 2018, 8:26am

Yea, off-course that’s possible too. However, that will take long duration depending on your data.