I am on version 0.7.0 running with CUDA 10.0.1.30, cuDnn 7.6.5, Nvidia Driver 430.64 running on GTX 1070 with 2 GPU 8GB memory.
I was able to fine tune on the voxforge dataset without any issues.
Now I am trying to train on a custom dataset. I have created a csv similar to Librispeech (based on the bin/import*.py
scripts) with each line containing the file name, file size and a wav with a less than 15 sec clip in Mono, 16K with 95 hours for train, 2.5 for dev and 2.5 for test. I have removed all spl characters, punctuation, unicode normalized, removed all numbers from transcript ( confirmed that my transcript only contains a-z and ').
When I start the training I see this error:
python DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.7.0-checkpoint --epochs 100 --train_files data/youtube_train100.csv --dev_files data/youtube_dev100.csv --learning_rate 0.000001 --scorer_path models/deepspeech-0.7.0-models.scorer --train_cudnn --use_allow_growth --train_batch_size 32 --dev_batch_size 32 --es_epochs 10 --early_stop True --export_dir youtubemodel --save_checkpoint_dir youtubemodel
I Loading best validating checkpoint from deepspeech-0.7.0-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:08 | Steps: 7 | Loss: 78.528744 Traceback (most recent call last):
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
[[tower_0/gradients/tower_0/MatMul_4_grad/tuple/control_dependency_1/_113]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
0 successful operations.
1 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 939, in run_script
absl.app.run(main)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 911, in main
train()
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 589, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 549, in run_set
feed_dict=feed_dict)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/MatMul_4_grad/tuple/control_dependency_1/_113]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 250, 32, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
1 derived errors ignored.
Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3':
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 939, in run_script
absl.app.run(main)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 911, in main
train()
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 475, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 313, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 191, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File "/home/Documents/DeepSpeech/training/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
return converted_call(f, options, args, kwargs)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
return f(*args, **kwargs)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
training)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
seed=self._seed)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/home/Documents/DeepSpeech/ds_venv/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Since I was able to successfully fine tune the 0.7.0 released model using the VoxForge dataset (used the import_voxforge.py
script to download and preprocess the audio files), I am assuming that my env is setup correctly. I used a batch size of 32 for training for the 115 hours voxforge_train.csv. I am also able to run evaluate and transcribe scripts on my dataset with the pretrained model.
Does this error still indicate any issue with the dataset?
I also tried reducing the batch size all the way to 4 and it still runs into the same error and this is the GPU mem usage when the error occurs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| N/A 56C P8 11W / N/A | 3532MiB / 8085MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1070 Off | 00000000:02:00.0 Off | N/A |
| N/A 56C P8 6W / N/A | 1455MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1177 G /usr/lib/xorg/Xorg 843MiB |
| 0 2171 G compiz 62MiB |
| 0 3421 G ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files 225MiB |
| 0 7593 C python 2395MiB |
| 1 7593 C python 1441MiB |
Any help on this is appreciated? Does this still indicate an issue with my dataset, particularly the transcript column or something else with my setup?
Thanks