Python error: Segmentation fault when training

At that point, it means there’s something wrong with your specific data.

I suspect this may be the case after this troubleshooting. I am writing a script right now that runs through all the audio files and checks they can be read by the python wave module.

We don’t even have a proper stack here, we can hardly see where it comes from.

It’s the same stack trace I posted before only failing somewhere else (Probably because of threading):

I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                   Fatal Python error: Segmentation fault

Thread 0x0000700007cf7000 (most recent call first):
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/numpy/core/numeric.py", line 501 in asarray
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 178 in _convert
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 217 in <listcomp>
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 209 in __call__

Thread 0x./bin/train_deepspeech.sh: line 36:  7505 Segmentation fault: 11

Can we please get more informations on those files ?

Like I said before, they are in the repo I previously shared.

Just to be sure, can you triple check those paths values ? Are they relative or absolute?

They are absolute. I even check they exist in my script with this function :slight_smile: :

check_file(){
  if [[ -f $1 ]]; then
    echo "Found $1"
  fi
}

Also, how are those built?

With the generate_csv.py file in our repo.

What is inside?

Example:

wav_filename,wav_filesize,transcript
/Users/allabana/develop/deepspeech/data/tarteel/recordings/15_81_3568321073.wav,1114156,و ا ت ي ن ا ه م _ ا ي ا ت ن ا _ ف ك ا ن و ا _ ع ن ه ا _ م ع ر ض ي ن 

Can you make a single-example reduced test-case?

Yes, will get back to you on that.

Maybe you should also run python under gdb to get more context on the crash.

Good call.

I have found some files that could not be read by wave using this gist.

Most of them had the following error so it looks like I need to fix these files:

file does not start with RIFF id

Will update on single file test.

Pick one that can be properly read with the gist you shared, and verify. If it works, it’s likely the reason. This would also explain the first stack (more detailed than the second one), where the failure is within feeding infra, and also why on different OS you get the crash at a different time.

I’m sorry, but I can’t just explore your fork to deduce and help you, that’s way too much work. Having it is nice, but I really need you to point me like you did.

At least, the LM files seems legit (I was wondering if they might have been improperly generated and be just a few bytes / kbytes).

I’m sorry, but I can’t just explore your fork to deduce and help you

That’s valid. The TL;DR of that script is:

  • Get all the audio files in a directory and iterate through them.
  • Calculate the audio file size.
  • Parses the first two numbers in the file name: <Chapter>_<verse>_<hash>.wav.
  • Get the transcript of the Chapter/Verse.
  • Create csv entry.

I tested out a single bad file and got the following error:

I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                   Traceback (most recent call last):
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/allabana/.virtualenvs/test-ds-1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Header mismatch: Expected fmt  but found JUNK
	 [[{{node DecodeWav}}]]
	 [[tower_0/IteratorGetNext]]

Which led me to a previous solution by you here :smiley:

That and the stack of the graph would suggest a bogus WAV file.

I was able to run DeepSpeech successfully with a single file.

python3 -u DeepSpeech.py \
  --alphabet_config_path "$ALPHABET_PATH" \
  --lm_binary_path "$LM_BINARY_PATH" \
  --lm_trie_path "$LM_TRIE_PATH" \
  --train_files "$TRAIN_CSV_FILE" \
  --test_files "$TEST_CSV_FILE" \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 35 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

However, I get the following error when it finishes:

[scorer.cpp:77] FATAL: "(access(filename, (1<<2))) == (0)" check failed. Invalid language model path

and I’m assuming that’s because I need to provide the binary and trie path.

That’s about right. It looks like your are on the path to a solution :slight_smile:

1 Like

Thanks @lissyx for your help! (Especially during the holidays)

I was wondering if you have any thoughts if there should be any pre-processing in DeepSpeech somewhere to make sure training files don’t cause a segfault before training starts?

If so, point me in the right direction and I don’t mind contributing back to the original repo.

It seems that some files still cause DeepSpeech to crash even though I checked them with Python’s wave module/converted them using ffmpeg.

No, unless there’s a bug of course. Not in DeepSpeech.py nor util/feeding.py. Maybe a side tool ?

We’ve got reports from TensorFlow-level dealing with some WAV file, but we don’t know exactly whats wrong.

I’ve been doing some debugging and it looks like there is something breaking between v0.5.0 and v0.6.0.

I was able to train successfully on the utf8-ctc-v2 branch and v0.5.0 release, however, I was unable to train on anything after that (i.e >v0.5.1).

I tried passing the --utf8 true flag but that doesn’t resolve anything.

It looks like there is some failure going on in pandas when certain operations are being applied on the transcript frame (most likely caused by the Arabic text)
based on the segfault stack trace (which unfortunately hasn’t been descriptive).
My hunch is that this is coming from utils.feeding.create_dataset

...
 # Convert to character index arrays
 df = df.apply(partial(text_to_char_array, alphabet=Config.alphabet), result_type='broadcast', axis=1)
...

If I find anything, I’ll open up an issue on Github and report here.

I’ll post my training results soon to help others train on Arabic text :slight_smile:

Edit:

Note to anyone in the future -X faulthandler is a helpful python flag for debugging!

Could you isolate one sentence that repro the issue, and try to find if we can track that down to one character ?

UTF-8 being broken is really not a good thing.

Also, could you try running with LC_ALL=C ?

We also moved from TensorFlow r1.13 to r1.14, so it could still be consistent.

Could you isolate one sentence that repro the issue, and try to find if we can track that down to one character ?

That’s what I’m trying to do.

I have a list of original ‘bad files’ but that were usable after conversion so will see if the issue is caused by:
a) Bad conversion
b) Bad transcript

Also, could you try running with LC_ALL=C

Sure, is this a DeepSpeech.py or compilation flag?

We also moved from TensorFlow r1.13 to r1.14, so it could still be consistent.

Good to note. Will document reproducible environment.

Environment flag: LC_ALL=C python DeepSpeech.py [...]

1 Like

I was able to train a model on 0.6.0 with our arabic character encodings. It looks like there was an issue with encoding in the transcript of the CSV files.

Thanks for all the help @lissyx!

2 Likes

Hi Anas,

Thanks for sharing your work in Github. I see that your .csv files separate the Arabic letters. I also see that you use the the underscore character “_” to separate between two words. However, the language model is trained on ordinary written Arabic text where space is used to separate any two words and most letters are not separated. I know that Deepspeech uses the language model to correct mistakes in spellings and grammar in the transcription. So that means, the output of your model needs to be transformed into original Arabic text before using the language model to correct it. Right? Please correct me if I am wrong. I am also wondering what could go wrong if you just write ordinary Arabic text in your “.csv” files?

Do you mean there’s a discrepency between how the acoustic model learnt spaces and how the language model is built ? If my understanding is correct, I would say that you are right: the usage needs to be consistent between both.

1 Like

Thanks lissyx for your reply! Yes, this is exactly what I meant.

@moh.ahm.abdelraheem @lissyx

Thanks for the info! Yes that was one of the issues we came across (and one of the reasons we were getting a warning about the CTC feature length being shorter than transcription length).

So far we have:

a. Removed diacritics from both LM and transcription
b. Checked CTC feature length with transcription length
c. Removed spaces between letters.

Regarding (b), @lissyx there was some code that checked for that but apparently it was somewhat refactored and moved in 0.6.0? Do you have any context on why that is and whether we can include that check when preprocessing the data (i.e. when checking the alphabet characters).

Why ?

No, there was never code for that. It’s importer-level that this check is performed.

Regarding (a), we wanted to start simple and work our way towards diacritics. We’ll add them after we’ve found suitable training parameters with a decent workflow.

Regarding (b), would a contribution to add this check be welcome or is that out of scope for the project? I’m not familiar with the deepspeech development roadmap.

This makes no sense. You modified the dataset, your training parameters will not be the same.

Did you had any issue with some characters ?

As I said, it’s already implemented in all the importers.