Hi, I am trying to start training proccess on latest DeepSpeech release, but getting seg fault every time.
my flags
CUDA_VISIBLE_DEVICES=2 python3 -u DeepSpeech.py --train_files ../data/train/train.csv --dev_files ../data/dev/dev.csv --test_files ../data/test/test.csv --train_batch_size 12 --dev_batch_size 12 --test_batch_size 8 --n_hidden 2048 --epochs 50 --dropout_rate 0.27 --learning_rate 0.0001 --export_dir ../data/ru_model/ --checkpoint_dir ../data/checkout/ --alphabet_config_path ../data/alphabetru.txt --utf8=true
GPUs:
| NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 |
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 52% 61C P2 77W / 280W | 10457MiB / 11178MiB | 10% Default |
| 1 GeForce GTX 1070 Off | 00000000:17:00.0 Off | N/A |
| 0% 58C P0 37W / 151W | 0MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1070 Off | 00000000:65:00.0 Off | N/A |
| 0% 59C P0 34W / 151W | 0MiB / 8117MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18338 C python3 10447MiB |
And I get this error:
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 Fatal Python error: Segmentation fault
Thread 0x00007fb2ecffd700 (most recent call first):
File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/feeding.py", line 107 in to_sparse_tuple
File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/feeding.py", line 125 in generate_values
Thread 0x00007fb1d22fd700 (most recent call first):
File "/usr/lib/python3.5/multiprocessing/connection.py", line 379 in _recv
File "/usr/lib/python3.5/multiprocessing/connection.py", line 407 in _recv_bytes
File "/usr/lib/python3.5/multiprocessing/connection.py", line 250 in recv
File "/usr/lib/python3.5/multiprocessing/pool.py", line 429 in _handle_results
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007fb1d2afe700 (most recent call first):
File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/sample_collections.py", line 304 in __getitem__
File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/sample_collections.py", line 311 in __iter__
File "/media/a/mark/DeepSpeech/training/deepspeech_training/util/helpers.py", line 92 in _limit
File "/usr/lib/python3.5/multiprocessing/pool.py", line 286 in <genexpr>
Thread 0x00007fb1d32ff700 (most recent call first):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 367 in _handle_workers
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007fb30a7fc700 (most recent call first):
File "/usr/lib/python3.5/threading.py", line 293 in wait
File "/usr/lib/python3.5/queue.py", line 164 in get
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007fb353fff700 (most recent call first):
File "/usr/lib/python3.5/threading.py", line 293 in wait
File "/usr/lib/python3.5/queue.py", line 164 in get
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007fb3f8856700 (most recent call first):
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
File "/media/a/mark/hello/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 956 in run
File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 548 in run_set
File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 588 in train
File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 911 in main
File "/media/a/mark/hello/lib/python3.5/site-packages/absl/app.py", line 250 in _run_main
File "/media/a/mark/hello/lib/python3.5/site-packages/absl/app.py", line 299 in run
File "/media/a/mark/DeepSpeech/training/deepspeech_training/train.py", line 939 in run_script
File "DeepSpeech.py", line 12 in <module>
Segmentation fault (core dumped)
Also I’ve tried different DeepSpeech versions (such as 0.6.1), but got same errors, can you please point me in a right direction (may be I have wrong CUDA and etc, or a hardware problem)
My dataset is medium-large (around 2000 hours) all the wav files seem to be fine (russian lang).