Double free or corruption (out) Fatal Python error: Aborted

adyel · December 5, 2020, 2:41am

Hello guys,

I am fine-tuning DeepSpeech 0.9.2 for the Indian subcontinent English. I have collected 10-hour male speech and 5-hour female speech for bn_IN and 7 hours of both male and female hi_IN English speaking recording. I have chosen those two groups specifically because their accent is closest to my target audience.

Fine Tuning was not easy at all, I keep getting hit by a roadblock. The alphabet characters were a big pain point too. Although the sentences were in the English alphabet check_characters.py falsy flagged a lot of characters as missing from the alphabet. I replaced all the missing characters using this code and ignored the check_characters.py report.

def replace_func(text):
    text = text.replace('&', "and")
    return re.sub("[^a-z'\\s]", "", text)

Due to the long time required for the training I have cut down the data and now only using bn_IN data (15 hours). I am facing a completely different type of issue right now. After the first epoch, the training crashes stating double free or corruption (out). I could not find any proper fix for this. Can any knowledgable person please help me out here?

Epoch 1 |   Training | Elapsed Time: 0:30:14 | Steps: 5381 | Loss: 174.223625  WARNING: sample rate of sample "/kaggle/tmp/bengali_female_english/wav/f0088.wav" ( 48000 ) does not match FLAGS.audio_sample_rate. This can lead to incorrect results.
double free or corruption (out)
Fatal Python error: Aborted

Thread 0x00007fe5137fe700 (most recent call first):
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 379 in _recv
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 250 in recv
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 470 in _handle_results
  File "/opt/conda/lib/python3.7/threading.py", line 870 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fe512ffd700 (most recent call first):
  File "/kaggle/tmp/DeepSpeech/training/deepspeech_training/util/helpers.py", line 97 in _limit
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 292 in _guarded_task_generation
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 426 in _handle_tasks
  File "/opt/conda/lib/python3.7/threading.py", line 870 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fe513fff700 (most recent call first):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 413 in _handle_workers
  File "/opt/conda/lib/python3.7/threading.py", line 870 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fe5977fe700 (most recent call first):
  File "/opt/conda/lib/python3.7/threading.py", line 296 in wait
  File "/opt/conda/lib/python3.7/queue.py", line 170 in get
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fe597fff700 (most recent call first):
  File "/opt/conda/lib/python3.7/threading.py", line 296 in wait
  File "/opt/conda/lib/python3.7/queue.py", line 170 in get
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fe59cd1d700 (most recent call first):
  File "/opt/conda/lib/python3.7/threading.py", line 296 in wait
  File "/opt/conda/lib/python3.7/queue.py", line 170 in get
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fe6e28d2740 (most recent call first):
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956 in run
  File "/kaggle/tmp/DeepSpeech/training/deepspeech_training/train.py", line 570 in run_set
  File "/kaggle/tmp/DeepSpeech/training/deepspeech_training/train.py", line 605 in train
  File "/kaggle/tmp/DeepSpeech/training/deepspeech_training/train.py", line 948 in main
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 251 in _run_main
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 303 in run
  File "/kaggle/tmp/DeepSpeech/training/deepspeech_training/train.py", line 976 in run_script
  File "./DeepSpeech.py", line 12 in <module>
Aborted (core dumped)
!python3 util/taskcluster.py --source tensorflow --artifact convert_graphdef_memmapped_format --branch r1.15 --target .
!./convert_graphdef_memmapped_format --in_graph=/kaggle/working/models/ft_model.pb --out_graph=/kaggle/working/models/ft_model.pbmm
Downloading https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r1.15.cpu/artifacts/public/convert_graphdef_memmapped_format ...
Downloading: 100%

2020-12-05 01:30:20.906529: E tensorflow/contrib/util/convert_graphdef_memmapped_format.cc:79] Conversion failed Failed to load graph at '/kaggle/working/models/ft_model.pb' : /kaggle/working/models/ft_model.pb; No such file or directory

This was the training parameter. I have not specified a dropout or learning rate because I was not sure what is a good value here.

!python3 ./DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 3 --n_hidden 2048 --epochs 5 --export_dir /kaggle/working/models/ --checkpoint_dir /kaggle/tmp/model_checkpoints/ --train_files /kaggle/tmp/train.csv --dev_files /kaggle/tmp/dev.csv --test_files /kaggle/tmp/test.csv --export_file_name 'ft_model' --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0]  --augment volume[p=0.2,dbfs=-10:-40] --augment pitch[p=0.2,pitch=1~0.2]  --augment tempo[p=0.2,factor=1~0.5]

I have also shared the full code I used to train the model. If anyone has any suggestions, please, let me know. I am new to this and any help is really appreciated.

NB: I used Kaggle for training.

lissyx · December 5, 2020, 9:30am

please don’t use conda?

adyel · December 5, 2020, 2:24pm

I was using Kaggle. Looks like there is no way to ditch conda there. I’ll move the project to a local machine and try again. Let’s hope for the best

othiele · December 5, 2020, 3:34pm

Don’t use the standard learning rate for fine tuning and maybe use dropout.

And maybe no augmentation, but that is debatable. And you have a warning for your wav. You have very few material. This has to be perfect.

The error might also be some timeout at kaggle.

adyel · December 6, 2020, 1:00am

Thank you Olaf for your suggestoin. I applied your suggestion and it trained successfully. This was my parameters:

!python3 /content/DeepSpeech/DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 3 --n_hidden 2048 --epochs 50 --train_batch_size 16 --dev_batch_size 8 --test_batch_size 8 --learning_rate 0.0001 --dropout_rate 0.5 --audio_sample_rate 48000 --export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ --train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv --export_file_name 'ft_model'

I had another weird issue with CTC_Loss though.

  (0) Invalid argument: Not enough time for target transition sequence (required: 182, available: 177)11You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
	 [[{{node tower_0/CTCLoss}}]]

I used ignore_longer_outputs_than_inputs = True flag in the ctc_loss() function as a work around.

I set 50 epochs but the model was early stopped at the 15th epoch. This was the result. I did NOT use DeepSpeech 0.9.2 Checkpoint here by mistake.

Test on /content/test.csv - WER: 0.258427, CER: 0.076948, loss: 19.203972

This is not as good as I was hoping for, unfortunately Do you have any tweak suggestions for me?

Data used:
Native: bn_IN => Speaking: en_IN
Male: 10 hour
Female: 5 hour

Native: hi_IN => Speaking en_IN
Make: 7 hour
Female: 7 hour

Total Audio: 34 Hour.

Currently training another model with 0.9.2 checkpoint to see if it makes any difference.

othiele · December 6, 2020, 7:40am

audio_sample_rate 48000

Use 16KHz like all of us

This means you have bad data, get rid of it.

With this few material your results are acutally quite good. Use lower learning rate for transfer. And maybe try 0.8 checkpoint. Some had problems with the newer release.