Error on starting training inside docker container for DeepSpeech 0.9.1using gpu

I ran the following command and seems to be some cudnn issue which is strange since I used the Docker.train file provided as is. Am I missing something here?
python3 DeepSpeech.py
–alphabet_config_path data/alphabet.txt
–beam_width 32
–checkpoint_dir $ckpt_dir
–export_dir $ckpt_dir
–scorer $scorer_path
–n_hidden 128
–learning_rate 0.0001
–lm_alpha 0.75
–lm_beta 1.85
–train_batch_size 24
–dev_batch_size 24
–test_batch_size 2
–report_count 10
–epochs 500
–noearly_stop
–noshow_progressbar
–export_tflite
–train_files /datasets/deepspeech_wakeword_dataset/wakeword-train.csv,
/datasets/deepspeech_wakeword_dataset/wakeword-train-other-accents.csv,
/datasets/deepspeech_wakeword_dataset/wakeword-train.csv,
/datasets/india_portal_2may2019-train.csv,
/datasets/india_portal_2to9may2019-train.csv,
/datasets/india_portal_9to19may2019-train.csv,
/datasets/india_portal_19to24may2019-train.csv,
/datasets/brazil_portal_20to26june2019-wakeword-train.csv,
/datasets/brazil_portal_26juneto3july2019-wakeword-train.csv,
/datasets/japan_portal_3july2019-wakeword-train.csv,
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-train.csv,
/datasets/alexa-train.csv,
/datasets/alexa-polly-train.csv,
/datasets/alexa-sns.csv,
/datasets/india_portal_ww_data_04282020/custom_train.csv,
/datasets/india_portal_ww_data_05042020/custom_train.csv,
/datasets/india_portal_ww_data_05222020/custom_train.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_train.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_test.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_train.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_test.csv,
/datasets/ww_gtts_data_google_siri/custom_train.csv,
/datasets/ww_gtts_data_google_siri/custom_dev.csv,
/datasets/ww_polly_data_google_siri/custom_train.csv,
/datasets/ww_polly_data_google_siri/custom_test.csv
–dev_files /datasets/deepspeech_wakeword_dataset/wakeword-dev.csv,
/datasets/india_portal_2may2019-dev.csv,
/datasets/india_portal_2to9may2019-dev.csv,
/datasets/india_portal_9to19may2019-dev.csv,
/datasets/india_portal_19to24may2019-dev.csv,
/datasets/brazil_portal_20to26june2019-wakeword-dev.csv,
/datasets/brazil_portal_26juneto3july2019-wakeword-dev.csv,
/datasets/mixed_portal_backups_14_16_17_18_19_visteon_wakeword_dataset-dev.csv,
/datasets/alexa-dev.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv,
/datasets/india_portal_ww_data_05222020/custom_dev.csv,
/datasets/ww_gtts_data_google_siri/custom_dev.csv,
/datasets/ww_polly_data_google_siri/custom_dev.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv
–test_files /datasets/ww_test_aggregated.csv,
/datasets/alexa-train.csv,
/datasets/alexa-polly-train.csv,
/datasets/alexa-sns.csv,
/datasets/alexa-dev.csv,
/datasets/india_portal_ww_data_04282020/custom_train.csv,
/datasets/india_portal_ww_data_05042020/custom_train.csv,
/datasets/india_portal_ww_data_04282020/custom_dev.csv,
/datasets/india_portal_ww_data_05042020/custom_dev.csv,
/datasets/india_portal_ww_data_04282020/custom_test.csv,
/datasets/india_portal_ww_data_05042020/custom_test.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_train.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_dev.csv,
/datasets/india_portal_ww_data_augmented_04282020/custom_test.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_train.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_dev.csv,
/datasets/india_portal_ww_data_augmented_05042020/custom_test.csv

checkpoints
    I Could not find best validating checkpoint.
    I Could not find most recent checkpoint.
    I Initializing all variables.
    I STARTING Optimization
    I Training epoch 0...
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
        return fn(*args)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
        target_list, run_metadata)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
        run_metadata)
    tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
      (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
    	 [[{{node tower_0/conv1d}}]]
    	 [[concat/concat/_99]]
      (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
    	 [[{{node tower_0/conv1d}}]]
    0 successful operations.
    0 derived errors ignored.

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "DeepSpeech.py", line 12, in <module>
        ds_train.run_script()
      File "/DeepSpeech/training/deepspeech_training/train.py", line 976, in run_script
        absl.app.run(main)
      File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
        _run_main(main, args)
      File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
        sys.exit(main(argv))
      File "/DeepSpeech/training/deepspeech_training/train.py", line 948, in main
        train()
      File "/DeepSpeech/training/deepspeech_training/train.py", line 605, in train
        train_loss, _ = run_set('train', epoch, train_init_op)
      File "/DeepSpeech/training/deepspeech_training/train.py", line 570, in run_set
        feed_dict=feed_dict)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
        run_metadata_ptr)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
        feed_dict_tensor, options, run_metadata)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
        run_metadata)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
        raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
      (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
    	 [[node tower_0/conv1d (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
    	 [[concat/concat/_99]]
      (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
    	 [[node tower_0/conv1d (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
    0 successful operations.
    0 derived errors ignored.

    Original stack trace for 'tower_0/conv1d':
      File "DeepSpeech.py", line 12, in <module>
        ds_train.run_script()
      File "/DeepSpeech/training/deepspeech_training/train.py", line 976, in run_script
        absl.app.run(main)
      File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
        _run_main(main, args)
      File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
        sys.exit(main(argv))
      File "/DeepSpeech/training/deepspeech_training/train.py", line 948, in main
        train()
      File "/DeepSpeech/training/deepspeech_training/train.py", line 483, in train
        gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
      File "/DeepSpeech/training/deepspeech_training/train.py", line 316, in get_tower_results
        avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
      File "/DeepSpeech/training/deepspeech_training/train.py", line 243, in calculate_mean_edit_distance_and_loss
        logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
      File "/DeepSpeech/training/deepspeech_training/train.py", line 171, in create_model
        batch_x = create_overlapping_windows(batch_x)
      File "/DeepSpeech/training/deepspeech_training/train.py", line 69, in create_overlapping_windows
        batch_x = tf.nn.conv1d(input=batch_x, filters=eye_filter, stride=1, padding='SAME')
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d
        name=name)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
        data_format=data_format, dilations=dilations, name=name)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
        op_def=op_def)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
        attrs, op_def, compute_device)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
        op_def=op_def)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
        self._traceback = tf_stack.extract_stack()

yes: context on your setup.

@lissyx could you please elaborate on that? I haven’t taken any additional steps because I thought that everything is already setup in the docker

hardware, os, stack, etc.

@lissyx m using dell-xps-15-9570 which is running ubuntu 18.04 and has NVIDIA® GeForce™ GTX 1050Ti gpu

gpu mem? dataset size? when do you hit this error?

can’t you just be explicit at once?

@lissyx gpu memory is 4042MiB. The dataset size is 3 hours and the error occurs right in the beginning of the training. I have tried reducing the training and dev batch size to as small as 2 to make sure it was not running out of memory but I still encounter the same error

4Gb is likely not enough

@lissyx then how come when I run the ‘docker build’ command, the training in ‘./bin/run-ldc93s1.sh’ runs successfully but now when I have start a docker container even ‘./bin/run-ldc93s1.sh’ does not run

export TF_FORCE_GPU_ALLOW_GROWTH=true

if you do this, it works…this is in the documentation

1 Like