Is DeepSpeech not meant for one word audio files?

Epoetin · July 30, 2020, 3:48pm

@lissyx Working on fine-tuning. Running into an error. My thoughts suspect OOM but not sure.

Cuda and cuDNN info:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

root@d7284da3dc5c:/DeepSpeech# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5

fine-tuning bash script:

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;


export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
#export TF_FORCE_GPU_ALLOW_GROWTH=true

python3 -u DeepSpeech.py \
  --train_files //tmp/external/google_cmds_csvs/train.csv \
  --test_files  //tmp/external/google_cmds_csvs/test.csv \
  --dev_files  //tmp/external/google_cmds_csvs/dev.csv \
  --epochs 5 \
  --train_batch_size 1 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ \
  --use_allow_growth  \
  --n_hidden 2048 \
  --train_cudnn  \
  --checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/  \
  "$@"

I imagine this error means my GPU is running out of memory? Probably can’t handle the n_hidden 2048? Even with batch sizes equal to 1…

I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 25.073557
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 22.607900
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 4 | Loss: 19.332927
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 5 | Loss: 18.299621
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 6 | Loss: 22.737631
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 7 | Loss: 21.567622
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 8 | Loss: 23.387867
2020-07-30 15:45:48.370974: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-07-30 15:45:48.371022: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:830] Non-OK-status: GpuLaunchKernel(ColumnReduceKernel<IN_T, OUT_T, Op>, grid_dim, block_dim, 0, cu_stream, in, out, extent_x, extent_y, op, init) status: Internal: unknown error
Fatal Python error: Aborted

Thread 0x00007f5199679700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/pyAborted

lissyx · July 30, 2020, 3:50pm

Sorry, first time I see that error …

Epoetin · July 30, 2020, 3:54pm

I tried adding in my own --learning_rate == 0.0001

I think I have seen this error before in other discourse topics. Not sure what it means.

Error:

+ [ ! -f DeepSpeech.py ]
+ export NVIDIA_VISIBLE_DEVICES=0
+ export CUDA_VISIBLE_DEVICES=0
+ python3 -u DeepSpeech.py --train_files //tmp/external/google_cmds_csvs/train.csv --test_files //tmp/external/google_cmds_csvs/test.csv --dev_files //tmp/external/google_cmds_csvs/dev.csv --epochs 5 --train_batch_size 1 --dev_batch_size 1 --test_batch_size 1 --export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ --use_allow_growth --n_hidden 2048 --train_cudnn --learning_rate 0.0001 --checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/
I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 2 | Loss: 23.942448
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
         [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 595, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 560, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 473, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 321, in get_tower_results
    gradients = optimizer.compute_gradients(avg_loss)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/cudnn_rnn_grad.py", line 104, in _cudnn_rnn_backwardv3
    direction=op.get_attr("direction")) + (None,)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 749, in cudnn_rnn_backprop_v3
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'tower_0/cudnn_lstm/CudnnRNNV3', defined at:
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
[elided 4 identical lines from previous traceback]
  File "/DeepSpeech/training/deepspeech_training/train.py", line 473, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 312, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 239, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 190, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 128, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)

lissyx · July 30, 2020, 3:56pm

Epoetin:

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
         [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.

So, you might be hitting CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED · Issue #41630 · tensorflow/tensorflow · GitHub
Please try with TF_CUDNN_RESET_RND_GEN_STATE=1 env var

lissyx · July 30, 2020, 4:01pm

It can also be just a symptom of GPU OOM.

Epoetin · July 30, 2020, 4:03pm

Yeah retried with that env var and still same error. stops at about 7th step with 1 batch size. Is there a large amount of memory required for running cudNN with n_hidden 2048 size?

If so, I may have to work on getting more GPU power before I can successfully fine-tune.

lissyx · July 30, 2020, 4:04pm

Try reducing the network size, you’ll know quite fast

Epoetin · July 30, 2020, 4:11pm

Yep. Tried to train with --n_hidden 100 with --train_cudnn and it was able to train for awhile. Maybe my GPU just can’t handle cudnn. My gpu usage skyrockets whenever I use that flag.

Topic		Replies	Views
Train model but actual prediction is too poor DeepSpeech	53	1680	May 5, 2020
DeepSpeech Training own English model for call center speech recognition DeepSpeech	22	3254	October 8, 2019
Using Deep Speech DeepSpeech	34	12845	August 20, 2019
Inference prediction with own trained model DeepSpeech	9	1419	September 19, 2018
Running Deepspeech 0.7.4 on Google Commands Dataset DeepSpeech	24	1147	July 24, 2020

Is DeepSpeech not meant for one word audio files?

Related topics