Python error: Segmentation fault when training

Hi,

My ds-ctcdecoder version is 0.6.0a15
I am running Debian GNU/Linux 9.11 (stretch) x86_64
I am using DeepSpeech version 0.6.0-alpha.15

I’ve been trying to train DeepSpeech on my own dataset.
Most of my work is available here.
It has the train/test csv splits, alphabet, vocab, trie, etc.
I generated the alphabet and vocab from data/tarteel/quran.json using bin/generate_(alphabet|vocabulary).py.

The binary and arpa files were generated using:

lmplz --order 5 --text vocabulary.txt --arpa words.arpa
build_binary -q 8 trie words.arpa lm.binary
../native_client/generate_trie alphabet.txt lm.binary quran.trie

native_client was compiled in a different directory than the one I was working in, but with the same DeepSpeech version.

When running the following command

python3 -u "$PATH_TO_DEEPSPEECHPY" \
  --log_dir "$LOG_DIR" \
  --summary_dir "$SUMMARY_DIR" \
  --alphabet_config_path "$ALPHABET_PATH" \
  --checkpoint_dir "$CHECKPOINT_DIR" \
  --train_files "$TRAIN_CSV_FILE" \
  --dev_files "$DEV_CSV_FILE" \
  --test_files "$TEST_CSV_FILE" \
  --export_dir "$EXPORT_DIR" \
  --lm_binary_path "$LM_BINARY_PATH" \
  --lm_trie_path "$LM_TRIE_PATH" \
  --lm_alpha 1.5 \
  --dropout_rate 0.30 \
  --train_batch_size 1 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 2048 \
  --epochs 35 \
  --early_stop true \
  --es_steps 6 \
  --es_mean_th 0.1 \
  --es_std_th 0.1 \
  --learning_rate 0.00095 \
  "$@"

I get the following stack trace:

I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                    Fatal Python error: Segmentation fault

Thread 0x00007f84dda86700 (most recent call first):
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/pandas/core/dtypes/common.py", line 1789 in is_extension_array_dtype
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/pandas/core/internals/blocks.py", line 3255 in get_block_type
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/pandas/core/internals/blocks.py", line 3284 in make_block
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1518 in __init__
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/pandas/core/series.py", line 321 in __init__
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/pandas/core/frame.py", line 909 in iterrows
  File "/home/anas/tarteel_ws/ml-infra/models/deepspeech/util/feeding.py", line 104 in generate_values
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 547 in generator_py_func
  File "/home/anas/.virtualenvs/ds-env/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 217 in __call__

Thread 0xFatal Python error: Bus error

./train_deepspeech.sh: line 97: 16523 Bus error

Any suggestions on why this is happening?

No, segfault with “Bus error” is weird, I don’t know what we can do there.

Please reproduce with latest 0.6.0, and share more details on your hardware. Stretch should work, since this is built on Ubuntu 14.04.

I am training with Google Cloud VM (Tesla T4, Compute 7.5) with Cuda V10.0 and CuDNN v7.

Edit:

Using a Macbook Pro (OS X Catalina 10.15.1) with an Intel CPU, Python 3.7, and DeepSpeech release 0.6.0 running the same script as above, I get the following stack trace. DS was compiled and the trie, binary, and arpa files were regenerated accordingly:

Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                   Fatal Python error: Segmentation fault

Thread 0x000070000f15c000 (most recent call first):
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 296 in wait
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/queue.py", line 170 in get
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 159 in run
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x000070000ec59000 (most recent call first):
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 296 in wait
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/queue.py", line 170 in get
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 159 in run
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x0000000112accdc0 (most recent call first):
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429 in _call_tf_sessionrun
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341 in _run_fn
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356 in _do_call
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350 in _do_run
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173 in _run
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950 in run
  File "DeepSpeech.py", line 599 in run_set
  File "DeepSpeech.py", line 631 in train
  File "DeepSpeech.py", line 938 in main
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/absl/app.py", line 250 in _run_main
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/absl/app.py", line 299 in run
  File "DeepSpeech.py", line 965 in <module>
./train_deepspeech.sh: line 40: 92341 Segmentation fault: 11

Why go you need to rebuild? Can you please rely on pre-built binaries? Can you verify and share your setup steps?

It’s hard to debug if you don’t share the proper informations…

Sorry, to clarify, I built Tensorflow to generate the ‘generate_trie’ executable. I did not rebuild deepspeech from scratch.

My steps:

  1. Prepare CSV files, vocab, and alphabet
  2. Generate arpa file
  3. Generate lm.binary
  4. Generate trie file
  5. Run the training script using DeepSpeech.py

The specific instructions I followed are in the repo README. I used KenLM to generate the files.

FYI I also used the util.audio.convert_audio function to convert the audio into a format compatible with deepspeech.

Why?

Can you share the actual command line you are using?

Can you also share pip list | grep ctcdecoder?

Can you use bin/run-ldc93s1.sh to verify basics of your setup?

  1. Running bin/run-ldc93s1.sh results in a successful training run.
  2. My ds-ctcdecoder version is 0.6.0
  3. I compiled tensorflow so that I can use the generate_trie executable and generate my own trie file. Is there another way I am supposed to generate it for my custom alphabet/vocab?

I am using the following commands for generating the files:

# Generate .arpa file
lmplz --text data/tarteel/vocabulary.txt --arpa  data/tarteel/words.arpa --o 4
# Generate lm.binary
build_binary -q 16 -b 7 trie data/tarteel/words.arpa data/tarteel/lm.binary
# Generate Trie
native_client/generate_trie data/tarteel/alphabet.txt \
                            data/tarteel/lm.binary \
                            data/tarteel/vocabulary.txt quran.trie

I simplified my training script to be as close to run-ldc93s1.sh:

python3 -u DeepSpeech.py \
  --alphabet_config_path "$ALPHABET_PATH" \
  --lm_binary_path "$LM_BINARY_PATH" \
  --lm_trie_path "$LM_TRIE_PATH" \
  --train_files "$TRAIN_CSV_FILE" \
  --test_files "$TEST_CSV_FILE" \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 35 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

I still get a segfault unfortunately…

generate_trie is bundled in the release native_client.*.tar.xz files. So you should not need to rebuild it

So at least your setup is fine.

At that point, it means there’s something wrong with your specific data. We don’t even have a proper stack here, we can hardly see where it comes from. And it does not matches your first report. It’s hard to be 100% sure we are facing the same issue here.

Can we please get more informations on those files ?

Just to be sure, can you triple check those paths values ? Are they relative or absolute ?

Also, how are those built ? What is inside ? Can you make a single-example reduced test-case ?

Maybe you should also run python under gdb to get more context on the crash.

At that point, it means there’s something wrong with your specific data.

I suspect this may be the case after this troubleshooting. I am writing a script right now that runs through all the audio files and checks they can be read by the python wave module.

We don’t even have a proper stack here, we can hardly see where it comes from.

It’s the same stack trace I posted before only failing somewhere else (Probably because of threading):

I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                   Fatal Python error: Segmentation fault

Thread 0x0000700007cf7000 (most recent call first):
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/numpy/core/numeric.py", line 501 in asarray
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 178 in _convert
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 217 in <listcomp>
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 209 in __call__

Thread 0x./bin/train_deepspeech.sh: line 36:  7505 Segmentation fault: 11

Can we please get more informations on those files ?

Like I said before, they are in the repo I previously shared.

Just to be sure, can you triple check those paths values ? Are they relative or absolute?

They are absolute. I even check they exist in my script with this function :slight_smile: :

check_file(){
  if [[ -f $1 ]]; then
    echo "Found $1"
  fi
}

Also, how are those built?

With the generate_csv.py file in our repo.

What is inside?

Example:

wav_filename,wav_filesize,transcript
/Users/allabana/develop/deepspeech/data/tarteel/recordings/15_81_3568321073.wav,1114156,و ا ت ي ن ا ه م _ ا ي ا ت ن ا _ ف ك ا ن و ا _ ع ن ه ا _ م ع ر ض ي ن 

Can you make a single-example reduced test-case?

Yes, will get back to you on that.

Maybe you should also run python under gdb to get more context on the crash.

Good call.

I have found some files that could not be read by wave using this gist.

Most of them had the following error so it looks like I need to fix these files:

file does not start with RIFF id

Will update on single file test.

Pick one that can be properly read with the gist you shared, and verify. If it works, it’s likely the reason. This would also explain the first stack (more detailed than the second one), where the failure is within feeding infra, and also why on different OS you get the crash at a different time.

I’m sorry, but I can’t just explore your fork to deduce and help you, that’s way too much work. Having it is nice, but I really need you to point me like you did.

At least, the LM files seems legit (I was wondering if they might have been improperly generated and be just a few bytes / kbytes).

I’m sorry, but I can’t just explore your fork to deduce and help you

That’s valid. The TL;DR of that script is:

  • Get all the audio files in a directory and iterate through them.
  • Calculate the audio file size.
  • Parses the first two numbers in the file name: <Chapter>_<verse>_<hash>.wav.
  • Get the transcript of the Chapter/Verse.
  • Create csv entry.

I tested out a single bad file and got the following error:

I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                   Traceback (most recent call last):
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Header mismatch: Expected fmt  but found JUNK
	 [[{{node DecodeWav}}]]
	 [[tower_0/IteratorGetNext]]

Which led me to a previous solution by you here :smiley:

That and the stack of the graph would suggest a bogus WAV file.

I was able to run DeepSpeech successfully with a single file.

python3 -u DeepSpeech.py \
  --alphabet_config_path "$ALPHABET_PATH" \
  --lm_binary_path "$LM_BINARY_PATH" \
  --lm_trie_path "$LM_TRIE_PATH" \
  --train_files "$TRAIN_CSV_FILE" \
  --test_files "$TEST_CSV_FILE" \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 35 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

However, I get the following error when it finishes:

[scorer.cpp:77] FATAL: "(access(filename, (1<<2))) == (0)" check failed. Invalid language model path

and I’m assuming that’s because I need to provide the binary and trie path.

That’s about right. It looks like your are on the path to a solution :slight_smile:

1 Like

Thanks @lissyx for your help! (Especially during the holidays)

I was wondering if you have any thoughts if there should be any pre-processing in DeepSpeech somewhere to make sure training files don’t cause a segfault before training starts?

If so, point me in the right direction and I don’t mind contributing back to the original repo.

It seems that some files still cause DeepSpeech to crash even though I checked them with Python’s wave module/converted them using ffmpeg.

No, unless there’s a bug of course. Not in DeepSpeech.py nor util/feeding.py. Maybe a side tool ?

We’ve got reports from TensorFlow-level dealing with some WAV file, but we don’t know exactly whats wrong.

I’ve been doing some debugging and it looks like there is something breaking between v0.5.0 and v0.6.0.

I was able to train successfully on the utf8-ctc-v2 branch and v0.5.0 release, however, I was unable to train on anything after that (i.e >v0.5.1).

I tried passing the --utf8 true flag but that doesn’t resolve anything.

It looks like there is some failure going on in pandas when certain operations are being applied on the transcript frame (most likely caused by the Arabic text)
based on the segfault stack trace (which unfortunately hasn’t been descriptive).
My hunch is that this is coming from utils.feeding.create_dataset

...
 # Convert to character index arrays
 df = df.apply(partial(text_to_char_array, alphabet=Config.alphabet), result_type='broadcast', axis=1)
...

If I find anything, I’ll open up an issue on Github and report here.

I’ll post my training results soon to help others train on Arabic text :slight_smile:

Edit:

Note to anyone in the future -X faulthandler is a helpful python flag for debugging!

Could you isolate one sentence that repro the issue, and try to find if we can track that down to one character ?

UTF-8 being broken is really not a good thing.

Also, could you try running with LC_ALL=C ?