Segmentation fault

Tortoise · December 12, 2019, 10:08am

Dear Support.

I am facing an error.

I Initializing variables…
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:37:58 | Steps: 3796 | Loss: 57.462339 Fatal Python error: Segmentation fault

Thread 0x00007f6ec7944700 (most recent call first):
File “/home/user/anaconda3/envs/zml/lib/python3.6/threading.py”, line 295 in wait
File “/home/user/anaconda3/envs/zml/lib/python3.6/queue.py”, line 164 in get
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 159 in run
File “/home/user/anaconda3/envs/zml/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/home/user/anaconda3/envs/zml/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f7012e17700 (most recent call first):
File “/home/user/anaconda3/envs/zml/lib/python3.6/threading.py”, line 295 in wait
File “/home/user/anaconda3/envs/zml/lib/python3.6/queue.py”, line 164 in get
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 159 in run
File “/home/user/anaconda3/envs/zml/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/home/user/anaconda3/envs/zml/lib/python3.6/threading.py”, line 884 in _bootstrap

Current thread 0x00007f70e9fad740 (most recent call first):
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1429 in _call_tf_sessionrun
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1341 in _run_fn
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1356 in _do_call
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1350 in _do_run
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1173 in _run
File “/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 950 in run
File “DeepSpeech.py”, line 599 in run_set
File “DeepSpeech.py”, line 631 in train
File “DeepSpeech.py”, line 938 in main
File “/home/user/.local/lib/python3.6/site-packages/absl/app.py”, line 250 in _run_main
File “/home/user/.local/lib/python3.6/site-packages/absl/app.py”, line 299 in run
File “DeepSpeech.py”, line 965 in
./run_training.sh: line 44: 13250 Segmentation fault (core dumped) python3 -u DeepSpeech.py

This is at checkpoint or change of epoch (I guess)

I want to ask that what might be the reason?
I have

TensorFlow: v1.14.0-21-ge77504a
DeepSpeech: v0.6.0-alpha.15-0-gb3787ee

If you can guide.

Greetings.

reuben · December 12, 2019, 10:54am

It isn’t, the stack shows it comes from a training step. I don’t know what’s causing this but I would start by not using Anaconda and seeing if that fixes it.

hashim · December 19, 2019, 11:27am

Getting similar error on running this test

python evaluate.py --test_files data/CV/ur/clips/test.csv

I Restored variables from best validation checkpoint at /home/hashim/.local/share/deepspeech/checkpoints/best_dev-245252, step 245252
Testing model on data/CV/ur/clips/test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00 2019-12-19 16:46:39.923201: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=–tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=–xla_hlo_profile.
2019-12-19 16:46:39.961734: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-12-19 16:46:40.079803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
Fatal Python error: Segmentation fault

Thread 0x00007f1cc68e4740 (most recent call first):
File “/home/hashim/Desktop/Hashim/UrduCorpus/deepspeech-venv-0.6.0/lib/python3.7/site-packages/ds_ctcdecoder/init.py”, line 116 in ctc_beam_search_decoder_batch
File “evaluate.py”, line 122 in run_test
File “evaluate.py”, line 155 in evaluate
File “evaluate.py”, line 168 in main
File “/home/hashim/Desktop/Hashim/UrduCorpus/deepspeech-venv-0.6.0/lib/python3.7/site-packages/absl/app.py”, line 250 in _run_main
File “/home/hashim/Desktop/Hashim/UrduCorpus/deepspeech-venv-0.6.0/lib/python3.7/site-packages/absl/app.py”, line 299 in run
File “evaluate.py”, line 177 in
Segmentation fault (core dumped)

@reuben I am continuing training from the release model, using my own data [train,dev,test].csv and building new lm as described in v0.6.0

Training went well and model was exported, but when i test it, i get the above error

lissyx · December 19, 2019, 11:35am

That is not the same error. In your case, it happens in the ds_ctcdecoder package. Please share pip list|grep ctcdecoder ?

hashim · December 19, 2019, 11:40am

pip list|grep ctcdecoder
ds-ctcdecoder 0.6.0

hashim · December 19, 2019, 11:43am

I also tried testing with

deepspeech --model export/output_graph.pb --lm data/lm/lm.binary --trie data/lm/trie --audio data/CV/ur/clips/026-26-muha.wav

but getting following similar error

deepspeech --model export/output_graph.pb --lm data/lm/lm.binary --trie data/lm/trie --audio data/CV/ur/clips/026-26-muha.wav
Loading model from file export/output_graph.pb
TensorFlow: v1.14.0-21-ge77504a
DeepSpeech: v0.6.0-0-g6d43e21
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-12-19 17:05:39.753428: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-19 17:05:39.754553: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-12-19 17:05:39.781038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-19 17:05:39.781330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:01:00.0
2019-12-19 17:05:39.781338: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-12-19 17:05:39.781378: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-19 17:05:39.781625: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-19 17:05:39.781857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-19 17:06:30.447586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-19 17:06:30.447611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-12-19 17:06:30.447616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-12-19 17:06:30.468836: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-19 17:06:30.469127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-19 17:06:30.469367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-19 17:06:30.469602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5585 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
Loaded model in 51.0s.
Loading language model from files data/lm/lm.binary data/lm/trie
Loaded language model in 0.0106s.
Running inference.
2019-12-19 17:06:30.797118: W tensorflow/core/framework/allocator.cc:107] Allocation of 134217728 exceeds 10% of system memory.
2019-12-19 17:06:30.876340: W tensorflow/core/framework/allocator.cc:107] Allocation of 134217728 exceeds 10% of system memory.
2019-12-19 17:06:31.079532: W tensorflow/core/framework/allocator.cc:107] Allocation of 134217728 exceeds 10% of system memory.
2019-12-19 17:06:31.139260: W tensorflow/core/framework/allocator.cc:107] Allocation of 134217728 exceeds 10% of system memory.
2019-12-19 17:06:31.198683: W tensorflow/core/framework/allocator.cc:107] Allocation of 134217728 exceeds 10% of system memory.
Segmentation fault (core dumped)

lissyx · December 19, 2019, 11:46am

That’s a good and a bad news. Good news, it’s not the python package itself, bad news, it’s the ctc decoder.

Are those our files ? Or yours ?

Does the official english released model works ?

hashim · December 19, 2019, 11:49am

It is created from my data

lissyx · December 19, 2019, 11:52am

Can you verify with our model ?

hashim · December 19, 2019, 12:06pm

Yeah it work if i use the lm of DS. i will try to re-check my lm and create again and verify if its working or not.
as of now, the issue due to incorrect lm and trie.

Thanks for the prompt help @lissyx. As always you rock

lissyx · December 19, 2019, 12:08pm

Please have a look at data/lm it should guide you.

axcn · April 27, 2020, 12:49pm

i also got the Segmentation fault.

root@4d8a33df56c5:/DeepSpeech# ./evaluate.py --test_files ./data/CV/zh-HK/clips/test.csv
2020-04-27 05:18:21.224252: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 05:18:21.229519: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400005000 Hz
2020-04-27 05:18:21.229722: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x554ed80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-27 05:18:21.230140: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-27 05:18:21.231884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-27 05:18:21.231915: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (-1)
2020-04-27 05:18:21.231931: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (4d8a33df56c5): /proc/driver/nvidia/version does not exist
I Loading best validating checkpoint from /root/.local/share/deepspeech/checkpoints/best_dev-135
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on ./data/CV/zh-HK/clips/test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00
Fatal Python error: Segmentation fault

Thread 0x00007eff367fc700 (most recent call first):
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 379 in _recv
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 407 in _recv_bytes
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 250 in recv
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 463 in _handle_results
File “/usr/lib/python3.6/threading.py”, line 864 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007eff36ffd700 (most recent call first):
File “/usr/local/lib/python3.6/dist-packages/deepspeech_training/util/helpers.py”, line 94 in _limit
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 290 in _guarded_task_generation
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 419 in _handle_tasks
File “/usr/lib/python3.6/threading.py”, line 864 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007eff377fe700 (most recent call first):
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 406 in _handle_workers
File “/usr/lib/python3.6/threading.py”, line 864 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007effc2fac740 (most recent call first):
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/swigwrapper.py”, line 364 in ctc_beam_search_decoder_batch
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 128 in ctc_beam_search_decoder_batch
File “/usr/local/lib/python3.6/dist-packages/deepspeech_training/evaluate.py”, line 112 in run_test
File “/usr/local/lib/python3.6/dist-packages/deepspeech_training/evaluate.py”, line 130 in evaluate
File “/usr/local/lib/python3.6/dist-packages/deepspeech_training/evaluate.py”, line 143 in main
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 250 in _run_main
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 299 in run
File “/usr/local/lib/python3.6/dist-packages/deepspeech_training/evaluate.py”, line 152 in run_script
File “./evaluate.py”, line 12 in
Segmentation fault

Is it the same problem for my situation?

I also built my own lm.binary for support zh-HK in UTF-8.