We changed the Server to P3.2xlarge. Still it is giving the error message
(pve)$ more bin/biology1.sh
export CUDA_VISIBLE_DEVICES=0
export TF_FORCE_GPU_ALLOW_GROWTH=true
python -u DeepSpeech.py --noshow_progressbar
–checkpoint_dir data/checkpoint
–export_dir data/models
–train_files data/trial1/biology2/biology2-train.csv
–test_files data/trial1/biology2/biology2-test.csv
–dev_files data/trial1/biology2/biology2-dev.csv
–n_hidden 2048
–train_cudnn true
–dev_batch_size 4
–train_batch_size 4
–test_batch_size 4
–epochs 10
–learning_rate 0.0001
–dropout_rate 0.15
–scorer data/lm/lm.scorer \
(pve):$./bin/biology2.sh
swig/python detected a memory leak of type ‘Alphabet *’, no destructor found.
I Loading best validating checkpoint from data/checkpoint/best_dev-748522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
I Training epoch 0…
Traceback (most recent call last):
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
[[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 968, in run_script
absl.app.run(main)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 940, in main
train()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 608, in train
train_loss, _ = run_set(‘train’, epoch, train_init_op)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 568, in run_set
feed_dict=feed_dict)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for ‘tower_0/cudnn_lstm/CudnnRNNV3’:
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 968, in run_script
absl.app.run(main)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 940, in main
train()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 487, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 313, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 240, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 191, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 129, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py”, line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper
return converted_call(f, options, args, kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted
return f(*args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 440, in call
training)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 518, in _forward
seed=self._seed)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py”, line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()