@tilmankamp , @lissyx
- **Have I written custom code (as opposed to running ex…amples on an unmodified clone of the repository)**: No
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: ubuntu 16.04
**TensorFlow installed from (our builds, or upstream TensorFlow)**:
- **TensorFlow version (use command below)**: 1.6
- **Python version**: 3.6
- **Bazel version (if compiling from source)**:
- **GCC/Compiler version (if compiling from source)**: CUDA 9.0
- **CUDA/cuDNN version**: cuDNN 7.0
- **GPU model and memory**: Tesla M60 * 4
- **Exact command to reproduce**: I am trying to run on the distributed tensorflow
This is my host (ps) script on parameter server:
```
python -u DeepSpeech.py \
--train_files /data/zh_data/data_thchs30/train.csv \
--dev_files /data/zh_data/data_thchs30/dev.csv \
--test_files /data/zh_data/data_thchs30/test.csv \
--train_batch_size 80 \
--dev_batch_size 80 \
--test_batch_size 40 \
--n_hidden 375 \
--epoch 200 \
--validation_step 1 \
--early_stop True \
--earlystop_nsteps 6 \
--estop_mean_thresh 0.1 \
--estop_std_thresh 0.1 \
--dropout_rate 0.22 \
--learning_rate 0.00095 \
--report_count 100 \
--use_seq_length False \
--export_dir /data/zh_data/exportDir/distributedTf/ \
--checkpoint_dir /data/zh_data/checkpoint/distributedCkp/ \
--decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
--alphabet_config_path /data/zh_data/alphabet.txt \
--lm_binary_path /data/zh_data/zh_lm.binary \
--lm_trie_path /data/zh_data/trie \
--ps_hosts localhost:2233 \
--worker_hosts localhost:2222 \
--task_index 0 \
--job_name ps
```
And, The below script is for the worker machine:
```
python -u DeepSpeech.py \
--train_files /data/zh_data/data_thchs30/train.csv \
--dev_files /data/zh_data/data_thchs30/dev.csv \
--test_files /data/zh_data/data_thchs30/test.csv \
--train_batch_size 80 \
--dev_batch_size 80 \
--test_batch_size 40 \
--n_hidden 375 \
--epoch 200 \
--validation_step 1 \
--early_stop True \
--earlystop_nsteps 6 \
--estop_mean_thresh 0.1 \
--estop_std_thresh 0.1 \
--dropout_rate 0.22 \
--learning_rate 0.00095 \
--report_count 100 \
--use_seq_length False \
--export_dir /data/zh_data/exportDir/distributedTf/ \
--checkpoint_dir /data/zh_data/checkpoint/distributedCkp/ \
--decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
--alphabet_config_path /data/zh_data/alphabet.txt \
--lm_binary_path /data/zh_data/zh_lm.binary \
--lm_trie_path /data/zh_data/trie \
--ps_hosts localhost:2233 \
--worker_hosts localhost:2222 \
--task_index 0 \
--job_name worker \
--coord_host localhost \
--coord_port 2501
```
The host machine(ps) is running completely fine. while In the worker machine I enounter this error at first,
```
2018-05-24 03:05:38.015799: F tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
Aborted (core dumped)
```
Thought that the previous process is consuming the GPU resource. So, i re-ran the worker script again. Then, i got another error.
```
E OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E
E Caused by op 'tower_1/Minimum', defined at:
E File "DeepSpeech.py", line 1838, in <module>
E tf.app.run()
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
E _sys.exit(main(argv))
E File "DeepSpeech.py", line 1820, in main
E train(server)
E File "DeepSpeech.py", line 1501, in train
E results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E File "DeepSpeech.py", line 640, in get_tower_results
E calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
E File "DeepSpeech.py", line 521, in calculate_mean_edit_distance_and_loss
E logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E File "DeepSpeech.py", line 417, in BiRNN
E layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), FLAGS.relu_clip)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4565, in minimum
E "Minimum", x=x, y=y, name=name)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E op_def=op_def)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
E op_def=op_def)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
E self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
E
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E
Traceback (most recent call last):
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
:
: [Note* Too long, so, I fragment..]
:
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_5' with dtype int32
E [[Node: Placeholder_5 = Placeholder[dtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E [[Node: b3/read_S591_G3013 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name="edge_2390_b3/read_S591", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]()]]
E
E The checkpoint in /data/zh_data/checkpoint/distributedCkp/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /data/zh_data/checkpoint/distributedCkp/.
```