DeepSpeech halts after initialising tensorflow devices [solved]

Hi there,

I’ve got a weird problem going on at the moment. I’m guessing that there’s probably a simple solution/I’m doing something stupid?

I’ve prepared the below command to train a DeepSpeech model starting from an existing collection of checkpoints that I have created in a prior train.


python3 DeepSpeech/DeepSpeech.py \
 --train_files '/work/cook-island-maori/models/cim_model/train/mi_train.csv' \
 --dev_files '/work/cook-island-maori/models/cim_model/train/mi_dev.csv' \
 --test_files '/work/cook-island-maori/models/cim_model/train/mi_test.csv' \
 --alphabet_config_path '/work/cook-island-maori/models/cim_model/train/alphabet.txt' \
 --lm_binary_path '/work/cook-island-maori/models/cim_model/lm/lm.pointers.binary' \
 --lm_trie_path '/work/cook-island-maori/models/cim_model/lm/pointers.trie' \
 --lm_weight 1.75 \
 --epoch 200 \
 --train_batch_size 16 \
 --dev_batch_size 16 \
 --test_batch_size 16 \
 --learning_rate 0.00005 \
 --max_to_keep 10 \
 --display_step 0 \
 --validation_step 1 \
 --dropout_rate 0.30 \
 --default_stddev 0.046875 \
 --checkpoint_dir /work/cook-island-maori/models/cim_model/checkpoints \
 --decoder_library_path /work/cook-island-maori/models/cim_model/native_client/libctc_decoder_with_kenlm.so \
 --log_level 0 \
 --summary_dir /work/cook-island-maori/models/cim_model/summaries \
 --summary_secs 120 \
 --wer_log_pattern "GLOBAL LOG: logwer(\'cim_model\', %s, %s, %f)" \
 --fulltrace 1 \
 --limit_train 0 \
 --limit_dev 0 \
 --limit_test 0 \
 --valid_word_count_weight 1 \
 --export_dir /work/cook-island-maori/models/cim_model/export \
 --checkpoint_secs 600

This is a loose attempt at transferring the model from one language to another with a similar alphabet.

Running the model produces the following logs:

2018-11-02 04:07:05.468907: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-02 04:07:05.558123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-02 04:07:05.558569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235pciBusID: 0000:00:1e.0totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-02 04:07:05.558607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-02 04:07:05.865882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 04:07:05.865959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-02 04:07:05.865974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-02 04:07:05.866288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-11-02 04:07:08.899032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-02 04:07:08.899122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 04:07:08.899136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-02 04:07:08.899145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-02 04:07:08.899293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

After this point, it idles and stops doing anything. When I check the CPU usage, it’s on 0, and so is the IO. I’m a little lost at the moment, so I should probably take a break from the problem for a while.

If I come up with a solution, I’ll try to remember to write it up here later.

Ok- I figured it out, I had some missing wav files that were listed in the train.csv. Replacing them fixed the issue, but adding logging to DeepSpeech.py helped me to identify the problem.

I wasn’t getting errors to tell me this was the problem because of multi-threading, I think…