Fine tuning failing on custom dataset

I am on version 0.7.0 running with CUDA 10.0.1.30, cuDnn 7.6.5, Nvidia Driver 430.64 running on GTX 1070 with 2 GPU 8GB memory.
I was able to fine tune on the voxforge dataset without any issues.

Now I am trying to train on a custom dataset. I have created a csv similar to Librispeech (based on the bin/import*.py scripts) with each line containing the file name, file size and a wav with a less than 15 sec clip in Mono, 16K with 95 hours for train, 2.5 for dev and 2.5 for test. I have removed all spl characters, punctuation, unicode normalized, removed all numbers from transcript ( confirmed that my transcript only contains a-z and ').

When I start the training I see this error:

python DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.7.0-checkpoint  --epochs 100 --train_files data/youtube_train100.csv --dev_files data/youtube_dev100.csv --learning_rate 0.000001 --scorer_path models/deepspeech-0.7.0-models.scorer --train_cudnn  --use_allow_growth --train_batch_size 32 --dev_batch_size 32 --es_epochs 10 --early_stop True --export_dir youtubemodel --save_checkpoint_dir youtubemodel
I Loading best validating checkpoint from deepspeech-0.7.0-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:08 | Steps: 7 | Loss: 78.528744                                                                               Traceback (most recent call last):
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048] 
	 [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
	 [[tower_0/gradients/tower_0/MatMul_4_grad/tuple/control_dependency_1/_113]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048] 
	 [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
0 successful operations.
1 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 939, in run_script
    absl.app.run(main)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 911, in main
    train()
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 589, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 549, in run_set
    feed_dict=feed_dict)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048] 
	 [[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[tower_0/gradients/tower_0/MatMul_4_grad/tuple/control_dependency_1/_113]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048] 
	 [[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
1 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 939, in run_script
    absl.app.run(main)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 911, in main
    train()
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 475, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 313, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 191, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Since I was able to successfully fine tune the 0.7.0 released model using the VoxForge dataset (used the import_voxforge.py script to download and preprocess the audio files), I am assuming that my env is setup correctly. I used a batch size of 32 for training for the 115 hours voxforge_train.csv. I am also able to run evaluate and transcribe scripts on my dataset with the pretrained model.

Does this error still indicate any issue with the dataset?

I also tried reducing the batch size all the way to 4 and it still runs into the same error and this is the GPU mem usage when the error occurs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
| N/A   56C    P8    11W /  N/A |   3532MiB /  8085MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   56C    P8     6W /  N/A |   1455MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1177      G   /usr/lib/xorg/Xorg                           843MiB |
|    0      2171      G   compiz                                        62MiB |
|    0      3421      G   ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files   225MiB |
|    0      7593      C   python                                      2395MiB |
|    1      7593      C   python                                      1441MiB |

Any help on this is appreciated? Does this still indicate an issue with my dataset, particularly the transcript column or something else with my setup?

Thanks

Not an expert, maybe @reuben or @lissyx have an idea. Maybe try less data to check setup.

Because according to docs you should use 3.6. But don’t know whether this is the error.

@othiele Thanks for the response.
Yes I did notice that the supported version is python 3.6 but since I did not have any issues with training on voxforge I did not consider upgrading the python version.

Previously each training (and dev and test) item was a wav file between 3 and 15 seconds with ~80% of them being over 10 secs. On the voxforge train set I noticed that majority of them are between 3 and 7 secs. After I modeled my data set in this fashion I got past the Failed to call ThenRnnForward with model config error and I am able to train (~80% of the items between 3 and 7 secs). I noticed that LibriSpeech clean train-100 has a different distribution (majority of them being around 10-12secs).

So my next question is there an optimal train data wav file timing to get lower WER? I did see see posts and the other notes saying each wav file should be less than 15 secs but did not see any recommendation on the distribution? Is there any correlation to the WER and the length of input training data (for each wav file) or is this just a function of what your hardware can handle?
Hope I am making sense here

thanks and appreciate your input

OK, so you probably had an out of memory error.

It’ll be interesting to hear @lissyx and @reuben, but I would argue that it is not the length (given a minimum), but rather quality, spoken accent and most importantly total number of data. Get a lot of data in the same style you want recognize later, that helps.

Thanks @othiele

@lissyx and @reuben: can you please shed some light on this?
My question is here is the correlation between duration of each training wav file and WER.
While < 15 secs is what I saw as norm being used, LibriSpeech has most wav files around 10-12 seconds, VoxForge mostly has between 3-5 secs (this is from plotting the distribution of the train csv for duration based on the file size). Is there a recommended length for lowest WER in your experience or as @othiele has noted in his previous reply is this entirely depended on the sound quality, accent and the availability of a large training dataset.

Also does training on short wav files (3-5 secs) mean that the inference engine would give lower WER on similar short test wav when compared to 15-20 second? Hope I am making sense here

Thanks for your time.

Those errors could be manifestation of a tensorflow upstream issue that we finally identified, and fixed. The TensorFlow r1.15 fix has been released a few days ago, so installing 1.15.4 should get rid of those.