Problem with fine tuning 0.81 checkpoint for specific domain like Biology

We are trying to train on top of the DeepSpeech model(0.81) with around 500 hours of audio files divided in to 5 batches. Each batch consists of 100 hours(wav format, 8bit sampling, 16 kHz) or 2000 files.
System config is AWS (p2.16xlarge, VCPU 64, ECU 201, 732 GiB RAM)
and here is the training command:

export TF_FORCE_GPU_ALLOW_GROWTH=true
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
python3 -u DeepSpeech.py --noshow_progressbar
–checkpoint_dir data/trial1/checkpoint
–train_files data/trial1/batch1-train.csv
–test_files data/trial1/batch1-test.csv
–dev_files data/trial1/batch1-test.csv
–train_batch_size 1
–test_batch_size 1
–dev_batch_size 1
–n_hidden 2048
–train_cudnn true
–epochs 10
–learning_rate 0.00001 \

To test the accuracy of Transcription
We have used test files of 5min length(200files )

Before training we calculated WER for these 200 test files for DeepSpeech model 0.81 and it gave
average WER = 30.69

after training with each batch we got these results
Batch1 WER = 33.43
Batch2 WER = 33.63
Batch3 WER = 53.00
Batch4 WER = 68.39
Batch5 WER = 63.67

What we are doing wrong and why is the WER increasing?

Also I observed that if I restarted the training with the same data let’s say with Batch1 and 0.81 version checkpoint, I get different WER results every time. Is that normal?

I think you should increase the epochs.

If we increase the epochs to above 10, the validation loss increases. Also WER increases even more.

Can I suggest you move to 0.9 ?

Why making batches ? It’s not unlikely that the model will have a harder time to learn from five smaller batches.

So mean value of 3 minutes per file. Our model has been produced on clips up to 10 secs. it’s not impossible this is making the model learning much lower.

16 GPUs for 100 hours, that might be a bit overkill.

Low batch size will impair learning

That’s what I can see from the top of my head.

Also maybe @reuben can weight in here, but I fear this might distribute too much data and impair learning as well.

(FTR, we are on holidays until next monday, so don’t have too much expectations)

Thank you Lissyx. Yes, we can move to 0.91 but before we move to that version I wanted to make sure I am not making any mistakes. Also we can try on smaller GPU even though it takes longer time to complete the training. Please let me know what would be the ideal batch size?.

As high as you can without getting OOM errors. Usually 32,64,96,… with your length maybe 8,16?

And try some dropout, maybe 0.15 or 0.3? And no batches.

Thank you let me try these options and I will come back with results.

And as @lissyx said, most people train on shorter (max. 10 sec) chunks. Do you plan on feeding chunks of your length in the future or is it hard to cut them into smaller pieces?

We have training files are of length varying from 20 sec to less than 2 min files. Similar to Libri speech. We are trying to use batch_size 8 or 16 or 32 it is giving error any specific reason or because it has too many GPUS?
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_leng
th, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 16, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3_3}}]]

Try a batch size of 4. I only have experience with smaller chunks and if we can transfer from that, you are about max with 8. But I don’t know if there is something in the RNN that is not scaling linearly with memory. I guess you can only have one batch per GPU, so if you have a very uneven batch, this could break stuff. Generally I would go with a smaller server and one big GPU like a p3.2xlarge which has 16GB.

Please search github issues, this can be a symptom of a known bug of TensorFlow / CUDNN that should be fixed with 1.15.4.

If not, since you have not shared the full log, we can’t decide whether there is still a bug in TensorFlow or if you are just hitting OOM on your GPU.

We changed the Server to P3.2xlarge. Still it is giving the error message
(pve)$ more bin/biology1.sh
export CUDA_VISIBLE_DEVICES=0
export TF_FORCE_GPU_ALLOW_GROWTH=true
python -u DeepSpeech.py --noshow_progressbar
–checkpoint_dir data/checkpoint
–export_dir data/models
–train_files data/trial1/biology2/biology2-train.csv
–test_files data/trial1/biology2/biology2-test.csv
–dev_files data/trial1/biology2/biology2-dev.csv
–n_hidden 2048
–train_cudnn true
–dev_batch_size 4
–train_batch_size 4
–test_batch_size 4
–epochs 10
–learning_rate 0.0001
–dropout_rate 0.15
–scorer data/lm/lm.scorer \

(pve):$./bin/biology2.sh
swig/python detected a memory leak of type ‘Alphabet *’, no destructor found.
I Loading best validating checkpoint from data/checkpoint/best_dev-748522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
I Training epoch 0…
Traceback (most recent call last):
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
[[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 968, in run_script
absl.app.run(main)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 940, in main
train()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 608, in train
train_loss, _ = run_set(‘train’, epoch, train_init_op)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 568, in run_set
feed_dict=feed_dict)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 158, 4, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘tower_0/cudnn_lstm/CudnnRNNV3’:
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 968, in run_script
absl.app.run(main)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 940, in main
train()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 487, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 313, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 240, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 191, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 129, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py”, line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper
return converted_call(f, options, args, kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted
return f(*args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 440, in call
training)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 518, in _forward
seed=self._seed)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py”, line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

After upgrading to TensorFlow 1.15.4

(pve) :~/DeepSpeech$ ./bin/biology2.sh
swig/python detected a memory leak of type ‘Alphabet *’, no destructor found.
I Loading best validating checkpoint from data/checkpoint/best_dev-748522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
I Training epoch 0…
I Finished training epoch 0 - loss: 33.487564
I Validating epoch 0 on data/audiosinc/dst/biology2/biology2-test.csv…
Traceback (most recent call last):
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 164, 4, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
[[tower_0/Where/_171]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 164, 4, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 968, in run_script
absl.app.run(main)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 940, in main
train()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 618, in train
set_loss, steps = run_set(‘dev’, epoch, init_op, dataset=source)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 568, in run_set
feed_dict=feed_dict)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 164, 4, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/Where/_171]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 164, 4, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘tower_0/cudnn_lstm/CudnnRNNV3’:
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 968, in run_script
absl.app.run(main)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 940, in main
train()
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 487, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 313, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 240, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 191, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File “/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py”, line 129, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py”, line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper
return converted_call(f, options, args, kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted
return f(*args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 440, in call
training)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 518, in _forward
seed=self._seed)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py”, line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/ubuntu/pve/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

After Added export TF_CUDNN_RESET_RND_GEN_STATE=1. it started working. Thank you for the hint.

1 Like

Ok, please ping TensorFlow people on CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED · Issue #41630 · tensorflow/tensorflow · GitHub and tell them maybe ?

I did put my comments to the above Tensorflow thread. Thank you

After changing the batch_size to 16 and combining all my data, training went really fast and we got around 32% WER compared to deep speech base model(without our training data) 30% WER for our inference test.
Now we start looking why model is not improving?. One of my concern is while building Language model with Biology related text we got around 1.1 Million distinct words, but documentation was suggested to limit to 500,000 words and so we limited to 500,000 words. Does it leads to the situation where it may not find the word it is looking for? if so why there is a limit of 500,000 words?. Please help me to understand better.

500 hours in total is not that much, so you might not be able to improve your audio model much further. But you are using an unusual length.

As for the language (textual) model, use as much text as you can. Don’t limit it if you don’t have. Ideally, the text you want to recognize is already in the text several times in different combinations.

Check the output of the audio model and then the final result. Maybe change the parameters for generating the lm to not prune as much.

There is no such hard limit, we document that to:

  • avoid having too big LM
  • filter weird words we might get from out unproperly scrapped data (like html ttag soup in the wikipedia data)

Please ensure you have properly designed your validation and test set when performing training. You might also need to explore different learning rate, dropout, etc.