Fail to implement Stacked LSTM

So I want to test the effectiveness of Stacked LSTM. There are two exp but both failed.

1.
change cudnnLSTM num layer
fw_cell = tf.contrib.cudnn_rnn.CudnnLSTM(num_layers=2,

The error

Traceback (most recent call last):
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [2, 2048, 2048, 1, 873, 32, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [2, 2048, 2048, 1, 873, 32, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 939, in run_script
absl.app.run(main)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 911, in main
train()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 588, in train
train_loss, _ = run_set(‘train’, epoch, train_init_op)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 548, in run_set
feed_dict=feed_dict)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [2, 2048, 2048, 1, 873, 32, 2048]
[[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [2, 2048, 2048, 1, 873, 32, 2048]
[[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3’:
File “./DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 939, in run_script
absl.app.run(main)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 911, in main
train()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 474, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 321, in get_tower_results
gradients = optimizer.compute_gradients(avg_loss)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/training/optimizer.py”, line 512, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_impl.py”, line 158, in gradients
unconnected_gradients)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py”, line 679, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py”, line 350, in _MaybeCompile
return grad_fn() # Exit early
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py”, line 679, in
lambda: grad_fn(op, *out_grads))
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/cudnn_rnn_grad.py”, line 104, in _cudnn_rnn_backwardv3
direction=op.get_attr(“direction”)) + (None,)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 749, in cudnn_rnn_backprop_v3
name=name)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

…which was originally created as op ‘tower_0/cudnn_lstm/CudnnRNNV3’, defined at:
File “./DeepSpeech.py”, line 12, in
ds_train.run_script()
[elided 4 identical lines from previous traceback]
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 474, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 312, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 239, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 190, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 128, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/layers/base.py”, line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper
return converted_call(f, options, args, kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted
return f(*args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 440, in call
training)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 519, in _forward
seed=self._seed)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py”, line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 2051, in cudnn_rnnv3
time_major=time_major, name=name)

2.
change geometry in train()
# Run through parametrized RNN implementation, as we use different RNNs
# for training and inference
output_1, output_state_1 = rnn_impl(layer_3, seq_length, previous_state, reuse)
# Reshape output from a tensor of shape [n_steps, batch_size, n_cell_dim]
# to a tensor of shape [n_steps*batch_size, n_cell_dim]
layers['rnn_output'] = output_1
layers['rnn_output_state'] = output_state_1
output, output_state = rnn_impl(output_1, seq_length, previous_state, reuse)
output = tf.reshape(output, [-1, Config.n_cell_dim])
layers['rnn_output'] = output
layers['rnn_output_state'] = output_state

The error

Traceback (most recent call last):
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 969, 32, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]]
[[tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_79]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 969, 32, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 943, in run_script
absl.app.run(main)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 915, in main
train()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 592, in train
train_loss, _ = run_set(‘train’, epoch, train_init_op)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 552, in run_set
feed_dict=feed_dict)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 969, 32, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_79]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 969, 32, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘tower_0/cudnn_lstm/CudnnRNNV3_1’:
File “./DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 943, in run_script
absl.app.run(main)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 915, in main
train()
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 478, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 316, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 243, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 198, in create_model
output, output_state = rnn_impl(output_1, seq_length, previous_state, reuse)
File “/home/training/0.7/DeepSpeech/training/deepspeech_training/train_stacked.py”, line 128, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/layers/base.py”, line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper
return converted_call(f, options, args, kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted
return f(*args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 440, in call
training)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 519, in _forward
seed=self._seed)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py”, line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/training/tmp/deepspeech-0.7-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Both can actually train, but when stepping into validation phase the error came out.
Any advice?

These look like CUDA errors, maybe GPU out of memory for example. Note that the second snippet you sent is reusing the same weights for both operations, that’s probably not what you want.

Since I can actually train it, I don’t think CUDA has problem.

This is test with batch size 16 (same error)

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 | Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 196.900360
Epoch 0 | Training | Elapsed Time: 0:00:01 | Steps: 2 | Loss: 175.930588
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 3 | Loss: 142.219279
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 4 | Loss: 133.463558
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 5 | Loss: 131.575320
Epoch 0 | Training | Elapsed Time: 0:00:04 | Steps: 6 | Loss: 126.396845
Epoch 0 | Training | Elapsed Time: 0:00:05 | Steps: 7 | Loss: 119.461465
Epoch 0 | Training | Elapsed Time: 0:00:05 | Steps: 8 | Loss: 116.458263
Epoch 0 | Training | Elapsed Time: 0:00:06 | Steps: 9 | Loss: 113.713181

Edit: Fortunately, batch_size 8 seems work

GPU OOM errors usually happen at the end of the first epoch when the longest sentences in the training set are processed. You should probably double check that it’s not the last training batch rather than the validation step.

1 Like

Hi, with your help I fixed the issue with first method (change num_layer)
However, batch size shrink to 1/4 of original size, which must have something wrong.
In addition, it’s weird that the output_graph.pb is still the same size as pretrained model(184487 KB). Is it because the create_inference_graph()? Thanks!

Those errors could be manifestation of a tensorflow upstream issue that we finally identified, and fixed. The TensorFlow r1.15 fix has been released a few days ago, so installing 1.15.4 should get rid of those.