Unable to start training process Segmentation fault (core dumped) D:

Aleks_Maksovich · April 23, 2020, 9:36am

Hi, I am trying to start training proccess on latest DeepSpeech release, but getting seg fault every time.

my flags

CUDA_VISIBLE_DEVICES=2 python3 -u DeepSpeech.py --train_files ../data/train/train.csv  --dev_files ../data/dev/dev.csv --test_files ../data/test/test.csv --train_batch_size 12 --dev_batch_size 12 --test_batch_size 8 --n_hidden 2048 --epochs 50 --dropout_rate 0.27 --learning_rate 0.0001 --export_dir ../data/ru_model/ --checkpoint_dir ../data/checkout/ --alphabet_config_path ../data/alphabetru.txt --utf8=true

GPUs:

| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 52%   61C    P2    77W / 280W |  10457MiB / 11178MiB |     10%      Default |

|   1  GeForce GTX 1070    Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   58C    P0    37W / 151W |      0MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1070    Off  | 00000000:65:00.0 Off |                  N/A |
|  0%   59C    P0    34W / 151W |      0MiB /  8117MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18338      C   python3                                    10447MiB |

And I get this error:

Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                    Fatal Python error: Segmentation fault

Thread 0x00007fb2ecffd700 (most recent call first):
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/feeding.py", line 107 in to_sparse_tuple
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/feeding.py", line 125 in generate_values

Thread 0x00007fb1d22fd700 (most recent call first):
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 429 in _handle_results
  File "/usr/lib/python3.5/threading.py", line 862 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00007fb1d2afe700 (most recent call first):
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/sample_collections.py", line 304 in __getitem__
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/sample_collections.py", line 311 in __iter__
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/helpers.py", line 92 in _limit
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 286 in <genexpr>

Thread 0x00007fb1d32ff700 (most recent call first):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 367 in _handle_workers
  File "/usr/lib/python3.5/threading.py", line 862 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00007fb30a7fc700 (most recent call first):
  File "/usr/lib/python3.5/threading.py", line 293 in wait
  File "/usr/lib/python3.5/queue.py", line 164 in get
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00007fb353fff700 (most recent call first):
  File "/usr/lib/python3.5/threading.py", line 293 in wait
  File "/usr/lib/python3.5/queue.py", line 164 in get
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00007fb3f8856700 (most recent call first):
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
  File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 956 in run
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 548 in run_set
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 588 in train
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 911 in main
  File "/media/a/mark/hello/lib/python3.5/site-packages/absl/app.py", line 250 in _run_main
  File "/media/a/mark/hello/lib/python3.5/site-packages/absl/app.py", line 299 in run
  File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 939 in run_script
  File "DeepSpeech.py", line 12 in <module>
Segmentation fault (core dumped)

Also I’ve tried different DeepSpeech versions (such as 0.6.1), but got same errors, can you please point me in a right direction (may be I have wrong CUDA and etc, or a hardware problem)
My dataset is medium-large (around 2000 hours) all the wav files seem to be fine (russian lang).

Aleks_Maksovich · April 22, 2020, 9:45am

pip3 list
absl-py 0.9.0
alembic 1.4.2
astor 0.8.1
attrdict 2.0.1
audioread 2.1.8
beautifulsoup4 4.9.0
bs4 0.0.1
certifi 2020.4.5.1
cffi 1.14.0
chardet 3.0.4
cliff 3.1.0
cmaes 0.4.0
cmd2 0.8.9
colorlog 4.1.0
decorator 4.4.2
deepspeech-training 0.7.0a3 /media/a/mark/DeepSpeech/training
ds-ctcdecoder 0.7.0a3
gast 0.2.2
google-pasta 0.2.0
grpcio 1.28.1
h5py 2.10.0
idna 2.9
joblib 0.14.1
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.0
librosa 0.7.2
llvmlite 0.31.0
Mako 1.1.2
Markdown 3.2.1
MarkupSafe 1.1.1
numba 0.47.0
numpy 1.18.1
opt-einsum 3.2.0
optuna 1.3.0
opuslib 2.0.0
pandas 0.24.2
pbr 5.4.5
pip 20.0.2
pkg-resources 0.0.0
prettytable 0.7.2
progress 1.5
progressbar2 3.50.1
protobuf 3.11.3
pycparser 2.20
pyparsing 2.4.7
pyperclip 1.8.0
python-dateutil 2.8.1
python-editor 1.0.4
python-utils 2.4.0
pytz 2019.3
pyxdg 0.26
PyYAML 5.3.1
requests 2.23.0
resampy 0.2.2
scikit-learn 0.22.2.post1
scipy 1.4.1
semver 2.9.1
setuptools 46.1.3
six 1.14.0
SoundFile 0.10.3.post1
soupsieve 2.0
sox 1.3.7
SQLAlchemy 1.3.16
stevedore 1.32.0
tensorboard 1.15.0
tensorflow-estimator 1.15.1
tensorflow-gpu 1.15.2
termcolor 1.1.0
tqdm 4.45.0
urllib3 1.25.8
wcwidth 0.1.9
webrtcvad 2.0.10
Werkzeug 1.0.1
wheel 0.34.2
wrapt 1.12.1

Aleks_Maksovich · April 22, 2020, 9:54am

I downloaded deepspeech 0.7.xx with git-lfs
also this guide dont cover the need of git-lfs https://github.com/mozilla/DeepSpeech/blob/master/doc/TRAINING.rst#training-your-own-model

reuben · April 22, 2020, 9:58am

It’s mentioned twice in the first 15 lines.

Aleks_Maksovich · April 22, 2020, 9:58am

nvm i am blind then sorry

reuben · April 22, 2020, 9:59am

Without more details it’s hard to know what’s causing the segmentation fault. Can you run it on a debugger and get a stacktrace from the crashing thread?

Aleks_Maksovich · April 22, 2020, 10:00am

alright I will do it, ty for reply

Aleks_Maksovich · April 22, 2020, 10:17am

gdb shows this(it may not be useful but I’ll send it here anyway):

0x00007fff996fbf74 in tensorflow::Status tensorflow::ctc::CTCLossCalculator::CalculateLoss<Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16 , Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const , 0, Eigen::Stride<0, 0> >, Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >(Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const&, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > const& , std::vector<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> >, std::allocator<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> > > > const&, bool, bool, bool, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, std::ve ctor<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, std::allocator<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1>, 0, Ei gen::Stride<0, 0> > > >, tensorflow::DeviceBase::CpuWorkerThreads*) const ()

Aleks_Maksovich · April 22, 2020, 12:34pm

(gdb) bt
#0  0x00007fff996fbf74 in tensorflow::Status tensorflow::ctc::CTCLossCalculator<float>::CalculateLoss<Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> >, Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >(Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> >, std::allocator<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> > > > const&, bool, bool, bool, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>*, std::vector<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, std::allocator<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > > >*, tensorflow::DeviceBase::CpuWorkerThreads*) const () from /media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#1  0x00007fff996fd793 in tensorflow::CTCLossOp<float>::Compute(tensorflow::OpKernelContext*) () from /media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#2  0x00007fff94fe804c in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
   from /media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#3  0x00007fff94fe837f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#4  0x00007fff95098261 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#5  0x00007fff95095958 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#6  0x00007fff956f2d5f in std::execute_native_thread_routine (__p=0x6867590) at /dt7-src/libstdc++-v3/src/nonshared11/../c++11/thread.cc:83
#7  0x00007ffff7bc16ba in start_thread (arg=0x7ffe76ffd700) at pthread_create.c:333
#8  0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Aleks_Maksovich · April 22, 2020, 12:34pm

catchsegv output:
Segmentation fault (core dumped)

Segmentation fault
Register dump:

 RAX: 0000000000000000   RBX: 0000000000000000   RCX: 00007f2ea988f269
 RDX: 000000000000000b   RSI: 0000000000001528   RDI: 00000000000014e5
 RBP: 00007f2d44f38de0   R8 : 000000000000000a   R9 : 0000000000000002
 R10: 00000000000000a5   R11: 0000000000000202   R12: 00007f2d44f38ec0
 R13: 00007f2d44f38f70   R14: 00007f2d44f38f70   R15: 00007f2d44f38f50
 RSP: 00007f2d44f37fb8

 RIP: 00007f2ea988f269   EFLAGS: 00000202

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000004   OldMask: 00000000   CR2: 00000008

 FPUCW: 0000037f   FPUSW: 00000000   TAG: 00000000
 RIP: 00000000   RDP: 00000000

 ST(0) 0000 0000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) 0000 0000000000000000
 ST(4) 0000 0000000000000000   ST(5) 0000 0000000000000000
 ST(6) 0000 0000000000000000   ST(7) 0000 0000000000000000
 mxcsr: 1f80
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000

Backtrace:
/lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29)[0x7f2ea988f269]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f2ea988f390]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZNK10tensorflow3ctc17CTCLossCalculatorIfE13CalculateLossIN5Eigen9TensorMapINS4_6TensorIKiLi1ELi1ElEELi16ENS4_11MakePointerEEENS5_INS6_IfLi1ELi1ElEELi16ES9_EENS4_3MapIKNS4_6MatrixIfLin1ELin1ELi1ELin1ELin1EEELi0ENS4_6StrideILi0ELi0EEEEENSD_ISF_Li0ESI_EEEENS_6StatusERKT_RKSt6vectorISP_IiSaIiEESaISR_EERKSP_IT1_SaISW_EEbbbPT0_PSP_IT2_SaIS13_EEPNS_10DeviceBase16CpuWorkerThreadsE+0x84)[0x7f2e4b3bff74]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow9CTCLossOpIfE7ComputeEPNS_15OpKernelContextE+0x11c3)[0x7f2e4b3c1793]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf7c04c)[0x7f2e46cac04c]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf7c37f)[0x7f2e46cac37f]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f2e46d5c261]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f2e46d59958]
/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x1686d5f)[0x7f2e473b6d5f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f2ea98856ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f2ea95bb41d]

Aleks_Maksovich · April 22, 2020, 11:13am

if this info is useless, pls advice me debugging steps

lissyx · April 22, 2020, 11:20am

I’d suspect broken virtualenv, it happened several times to me: old venv, several system upgrades, and this kind of breakages.

Aleks_Maksovich · April 22, 2020, 12:10pm

Yeah, I’ve already tried different envs including anaconda envs, virtualenv and python3 -m venv. Same error. I’m gonna try updating venv

Aleks_Maksovich · April 22, 2020, 12:12pm

And I don’t have sudo status on the remote server, can it affect the process?

lissyx · April 22, 2020, 12:33pm

It should not

Please make sure you setup a pure Python 3 venv from scratch.

lissyx · April 22, 2020, 12:36pm

Hm, are you sure you know what you are doing here?

Aleks_Maksovich · April 23, 2020, 9:51am

utf8 flag dosent affect the error, my transcripts are in utf8
maybe my alphabet file has different encoding, will check it
Well new env crashes on installing requirements (llvmlite package)
Haven’t toched anything in DeepSpeech folder

mark@smedx-server:/media/a/mark$ python3 -m venv dpenv -new venv
mark@smedx-server:/media/a/mark$ source dpenv/bin/activate -activate it
(dpenv) mark@smedx-server:/media/a/mark$ cd DeepSpeech -go to folder
follow guide:
(dpenv) mark@smedx-server:/media/a/mark/DeepSpeech$ pip3 install pip==20.0.2 wheel==0.34.2 setuptools==46.1.3
pip3 install --upgrade --force-reinstall -e .

it’s not a problem in my previous env, do I have to install lvvmlite?
ERROR: Failed building wheel for llvmlite

RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config
ERROR: Command errored out with exit status 1: /media/a/mark/dpenv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘"’"’/tmp/pip-install-bpn9muh0/llvmlite/setup.py’"’"’; file=’"’"’/tmp/pip-install-bpn9muh0/llvmlite/setup.py’"’"’;f=getattr(tokenize, ‘"’"‘open’"’"’, open)(file);code=f.read().replace(’"’"’\r\n’"’"’, ‘"’"’\n’"’"’);f.close();exec(compile(code, file, ‘"’"‘exec’"’"’))’ install --record /tmp/pip-record-apbeuwq8/install-record.txt --single-version-externally-managed --compile --install-headers /media/a/mark/dpenv/include/site/python3.5/llvmlite Check the logs for full command output.

Traceback (most recent call last):
File “/tmp/pip-install-bpn9muh0/llvmlite/ffi/build.py”, line 168, in
main()
File “/tmp/pip-install-bpn9muh0/llvmlite/ffi/build.py”, line 158, in main
main_posix(‘linux’, ‘.so’)
File “/tmp/pip-install-bpn9muh0/llvmlite/ffi/build.py”, line 109, in main_posix
“to the path for llvm-config” % (llvm_config,))
RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config
error: command ‘/media/a/mark/dpenv/bin/python3’ failed with exit status 1

Aleks_Maksovich · April 23, 2020, 10:03am

also tried to install this package separately
pip3 install llvmlite
got same errors as above

Aleks_Maksovich · April 23, 2020, 10:11am

anyways I will reclone the project and build it one more time

reuben · April 23, 2020, 11:34am

The --utf8 is not related to the encoding of the transcripts or the alphabet (those should always be UTF-8 encoded anyway), it has to do with the target alphabet of the model. With --utf8, the model predicts UTF-8 bytes directly instead of discrete characters defined in the alphabet file. With --utf8 the alphabet file is ignored completely. There’s some docs here: https://deepspeech.readthedocs.io/en/master/Decoder.html#decoder-docs