Error on training: failed to get device attribute 13 for device 0: CUDA_ERROR_UNKNOWN

Hi,

I’m training a portuguese model and I’m facing the below error:

I Saved new best validating model with loss 97.882270 to: <checkpoint_dir>\best_dev-2516
--------------------------------------------------------------------------------
I FINISHED optimization in 0:23:12.610174
2020-08-18 17:42:47.411058: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed to get device attribute 13 for device 0: CUDA_ERROR_UNKNOWN: unknown error
Fatal Python error: Aborted

Thread 0x00001a44 (most recent call first):
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 295 in wait
  File "<...>\appdata\local\programs\python\python36\lib\queue.py", line 164 in get
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\tensorflow_core\python\summary\writer\event_file_writer.py", line 159 in run
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 916 in _bootstrap_inner
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 884 in _bootstrap

Thread 0x00001b0c (most recent call first):
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 295 in wait
  File "<...>\appdata\local\programs\python\python36\lib\queue.py", line 164 in get
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\tensorflow_core\python\summary\writer\event_file_writer.py", line 159 in run
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 916 in _bootstrap_inner
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 884 in _bootstrap

Thread 0x00001a54 (most recent call first):
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 295 in wait
  File "<...>\appdata\local\programs\python\python36\lib\queue.py", line 164 in get
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\tensorflow_core\python\summary\writer\event_file_writer.py", line 159 in run
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 916 in _bootstrap_inner
  File "<...>\appdata\local\programs\python\python36\lib\threading.py", line 884 in _bootstrap

Current thread 0x00004238 (most recent call first):
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\tensorflow_core\python\client\session.py", line 699 in __init__
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\tensorflow_core\python\client\session.py", line 1585 in __init__
  File "<deepspeech_dir>\DeepSpeech\training\mozilla_voice_stt_training\evaluate.py", line 86 in evaluate
  File "<deepspeech_dir>\DeepSpeech\training\mozilla_voice_stt_training\train.py", line 665 in test
  File "<deepspeech_dir>\DeepSpeech\training\mozilla_voice_stt_training\train.py", line 937 in main
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\absl\app.py", line 250 in _run_main
  File "<...>\virtualenv\<virtualenv>\lib\site-packages\absl\app.py", line 299 in run
  File "<deepspeech_dir>\DeepSpeech\training\mozilla_voice_stt_training\train.py", line 961 in run_script
  File "DeepSpeech.py", line 12 in <module>

My execution .bat:

python DeepSpeech.py ^
  --alphabet_config_path D:\Pedro\EtherCity\deepspeech-test\cv-corpus-5.1-2020-06-22\pt\alphabet.txt ^
  --train_files D:\Pedro\EtherCity\deepspeech-test\cv-corpus-5.1-2020-06-22\pt\clips\train-all.csv ^
  --dev_files D:\Pedro\EtherCity\deepspeech-test\cv-corpus-5.1-2020-06-22\pt\clips\dev.csv ^
  --test_files D:\Pedro\EtherCity\deepspeech-test\cv-corpus-5.1-2020-06-22\pt\clips\test.csv ^
  --train_batch_size 80 ^
  --dev_batch_size 80 ^
  --test_batch_size 40 ^
  --n_hidden 375 ^
  --epochs 1 ^
  --early_stop True ^
  --dropout_rate 0.22 ^
  --learning_rate 0.00095 ^
  --report_count 100 ^
  --export_dir D:\Pedro\EtherCity\deepspeech-test\ptModel\results\model_export/ ^
  --checkpoint_dir D:\Pedro\EtherCity\deepspeech-test\ptModel\results\checkpoint

My virtualenv pip freeze:

absl-py==0.9.0
astor==0.8.1
attrdict==2.0.1
certifi==2020.6.20
chardet==3.0.4
gast==0.2.2
google-pasta==0.2.0
grpcio==1.31.0
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
Markdown==3.2.2
mozilla-voice-stt-tflite==0.9.0a6
mvs-ctcdecoder==0.9.0a6
numpy==1.16.0
opt-einsum==3.3.0
pandas==0.25.3
progressbar2==3.47.0
protobuf==3.12.4
python-dateutil==2.8.1
python-utils==2.3.0
pytz==2020.1
pyxdg==0.26
requests==2.24.0
semver==2.10.2
six==1.13.0
sox==1.4.0
tensorboard==1.15.0
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.2
termcolor==1.1.0
urllib3==1.25.10
webrtcvad==2.0.10
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0

I’m using CUDA v10.1 and cudnn-10.1 (cudnn64_7.dll).

How can I check what’s missing here?

Thanks in advance for any kind of support.

We don’t support training on Windows, and you are training with wrong CUDA version. Please check docs, TensorFlow r1.15 advertise using CUDA 10.0.

Ah Ok.

I’ll try with CUDA 10.0, and, besides not supporting Windows training, thanks for the reply.

:slight_smile:

If it is helpful for anyone, downgrading to CUDA 10.0 successfully get me past the above error.

After all the process has ended I’ll post if it was successful or not.

Thanks

I successfully generated a model, but it still needs to perfect it.

Nice!

1 Like

Good to know it ended up working. Please document if you can steps you had to follow if those are different from the doc we have, and send a PR?

Yeah sure!

I had to change a little bit of the code to process the mp3 files, made a work by hand to fill the .csv’s and made a little python utility to generate the alphabet for me as I was failing just because some unforseen chars in the transcripts.

I’ll separate a time to point that stuff out and we can talk about those things.

Nice to see that you’re interested.

:slight_smile:

check_characters.py should work, please report issues

are we talking about import_cv2.py ? I don’t see why it would not work under windows, so please report with as much as details as possible

not sure what you mean there