DeepSpeech halts after initialising tensorflow devices [solved]

mathematiguy · November 2, 2018, 4:35am

Hi there,

I’ve got a weird problem going on at the moment. I’m guessing that there’s probably a simple solution/I’m doing something stupid?

I’ve prepared the below command to train a DeepSpeech model starting from an existing collection of checkpoints that I have created in a prior train.


python3 DeepSpeech/DeepSpeech.py \
 --train_files '/work/cook-island-maori/models/cim_model/train/mi_train.csv' \
 --dev_files '/work/cook-island-maori/models/cim_model/train/mi_dev.csv' \
 --test_files '/work/cook-island-maori/models/cim_model/train/mi_test.csv' \
 --alphabet_config_path '/work/cook-island-maori/models/cim_model/train/alphabet.txt' \
 --lm_binary_path '/work/cook-island-maori/models/cim_model/lm/lm.pointers.binary' \
 --lm_trie_path '/work/cook-island-maori/models/cim_model/lm/pointers.trie' \
 --lm_weight 1.75 \
 --epoch 200 \
 --train_batch_size 16 \
 --dev_batch_size 16 \
 --test_batch_size 16 \
 --learning_rate 0.00005 \
 --max_to_keep 10 \
 --display_step 0 \
 --validation_step 1 \
 --dropout_rate 0.30 \
 --default_stddev 0.046875 \
 --checkpoint_dir /work/cook-island-maori/models/cim_model/checkpoints \
 --decoder_library_path /work/cook-island-maori/models/cim_model/native_client/libctc_decoder_with_kenlm.so \
 --log_level 0 \
 --summary_dir /work/cook-island-maori/models/cim_model/summaries \
 --summary_secs 120 \
 --wer_log_pattern "GLOBAL LOG: logwer(\'cim_model\', %s, %s, %f)" \
 --fulltrace 1 \
 --limit_train 0 \
 --limit_dev 0 \
 --limit_test 0 \
 --valid_word_count_weight 1 \
 --export_dir /work/cook-island-maori/models/cim_model/export \
 --checkpoint_secs 600

This is a loose attempt at transferring the model from one language to another with a similar alphabet.

Running the model produces the following logs:

2018-11-02 04:07:05.468907: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-02 04:07:05.558123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-02 04:07:05.558569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235pciBusID: 0000:00:1e.0totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-02 04:07:05.558607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-02 04:07:05.865882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 04:07:05.865959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-02 04:07:05.865974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-02 04:07:05.866288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-11-02 04:07:08.899032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-02 04:07:08.899122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 04:07:08.899136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-02 04:07:08.899145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-02 04:07:08.899293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

After this point, it idles and stops doing anything. When I check the CPU usage, it’s on 0, and so is the IO. I’m a little lost at the moment, so I should probably take a break from the problem for a while.

If I come up with a solution, I’ll try to remember to write it up here later.

mathematiguy · November 3, 2018, 10:26am

Ok- I figured it out, I had some missing wav files that were listed in the train.csv. Replacing them fixed the issue, but adding logging to DeepSpeech.py helped me to identify the problem.

I wasn’t getting errors to tell me this was the problem because of multi-threading, I think…

Topic		Replies	Views
DeepSpeech problems with video card DeepSpeech	6	1760	July 15, 2019
The same spped with cpu and with gpu DeepSpeech	42	2277	May 3, 2020
Fine tuning Deepspeech 0.9.1 with same alphabet DeepSpeech learning	40	1500	December 4, 2020
Issue when i'm running my own french model (SOLVED) DeepSpeech	10	849	April 25, 2019
Inference prediction with own trained model DeepSpeech	9	1429	September 19, 2018

DeepSpeech halts after initialising tensorflow devices [solved]

Related topics