EOFError when training multiple files

abdullah.tayyab · December 15, 2020, 1:04am

Hi there,

I also posted on Testing for correctness of the samples - #11 by abdullah.tayyab but I think my issue warrants a separate topic so I can provide full context of what I am trying to achieve.

I am trying to create an ASR system for Urdu (native Pakistani language) using the Ubuntu 16.04 Deep Learning AMI from AWS. I have installed DeepSpeech (0.9.2) in a virtualenv as dictated in the documentation. I have been testing with one data source using various configurations and train/test/dev files have worked fine with that one data source. I have assembled various data sources with decent transcriptions to expand the data set.

I have generated a separate scoring file for Urdu as pointed out in multiple topics in discourse. The command I am using the execute is:
python3 DeepSpeech.py --drop_source_layers 1 --alphabet_config_path /$HOME/Uploads/UrduAlphabet_newscrawl2.txt --checkpoint_dir /$HOME/DeepSpeech/dataset/trained_load_checkpoint --train_files /$HOME/Uploads/trainbusiness.csv --dev_files /$HOME/Uploads/devbusiness.csv --test_files /$HOME/Uploads/testbusiness.csv --epochs 2 --train_batch_size 32 --export_dir /$HOME/DeepSpeech/dataset/urdu_trained --export_file_name urdu --test_batch_size 12 --learning_rate 0.00001 --reduce_lr_on_plateau true --scorer /$HOME/Uploads/kenlm.scorer

Here comes the interesting part… this works perfectly when I execute this separately for all the different data sources. I have been generating training, test, and dev files for each data source and there were no issues when I used those files. I get the exception below when I try to combine the csv files and run the whole data set together. Obviously, I want to do that so there is more data and I can execute the whole data set for a high number of epochs.

I Loading best validating checkpoint from //home/ubuntu/DeepSpeech/dataset/trained_load_checkpoint/best_dev-150
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I Initializing variable: layer_6/bias
I Initializing variable: layer_6/bias/Adam
I Initializing variable: layer_6/bias/Adam_1
I Initializing variable: layer_6/weights
I Initializing variable: layer_6/weights/Adam
I Initializing variable: layer_6/weights/Adam_1
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:04 | Steps: 1 | Loss: 15.989467                                                   Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py", line 976, in run_script
    absl.app.run(main)
  File "/home/ubuntu/tmp/deepspeech-venv/lib/python3.7/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/ubuntu/tmp/deepspeech-venv/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py", line 948, in main
    train()
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py", line 605, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/train.py", line 571, in run_set
    exception_box.raise_if_set()
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/util/helpers.py", line 123, in raise_if_set
    raise exception  # pylint: disable = raising-bad-type
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/util/helpers.py", line 131, in do_iterate
    yield from iterable()
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/util/feeding.py", line 114, in generate_values
    for sample_index, sample in enumerate(samples):
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/util/augmentations.py", line 221, in apply_sample_augmentations
    yield from pool.imap(_augment_sample, timed_samples())
  File "/home/ubuntu/DeepSpeech/training/deepspeech_training/util/helpers.py", line 102, in imap
    for obj in self.pool.imap(fun, self._limit(it)):
  File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
EOFError

This is the exception when I have executed the same command on two data sets and try to improve the training by adding one more data set. All .wav files have been converted to mono and 16kHz.

I have also used the csv_combiner from GitHub - dabinat/deepspeech-tools: Scripts to simplify data prepping for Mozilla DeepSpeech. thinking that my code wasn’t combining them correctly.

I have shared both sources, combined csv files and separate csv files here.
combinedrun.zip (100.4 KB) separaterun.zip (86.1 KB)

Can someone please point me in the right direction?

Thank you!

othiele · December 15, 2020, 8:26am

This is strange and probably has to do with how you combine the data. Here are some ideas:

Never ever use a space in a filename. I doesn’t end well - ever. (business wavs)
I see that the single train and combined train have different newline chars. I don’t think this is the problem, but it is a difference.
Use some dropout for training. Search for values, maybe 0.15-0.3

I guess the error comes right away.

OH, and after looking at it again. The file size changes from single to combined!!! This looks like a good reason for EOF …

abdullah.tayyab · December 15, 2020, 9:28am

@othiele The files I attached were from another library, I have attached the ones that I combined myself. They have almost the same size.
combinedrun.zip (82.7 KB)

othiele · December 15, 2020, 9:36am

You did not attach the real files? I don’t get it. As stated above, you need to organize your files correctly. If the single files work, the combined files work as well.

Sorry, I don’t have the time to review more revisions of the “actual” files you used. You will have to do some detective work of your own.

abdullah.tayyab · December 15, 2020, 4:57pm

I did attach the real files. I combined writing some code myself and through a deep speech util library. The latest ones will have the same new line characters as they were through my code.

I have only reached out as my detective work did not yield any results. I have spent days trying to compare these files and the EOFError doesn’t really tell me where to look.

othiele · December 15, 2020, 5:06pm

No, you did not attach the corresponding files as one had different file sizes from the other. This shows that you don’t search systematically for the error. And it is impossible to find the error for us with that sort of information.

The problem is very likely the file size or something else with the files. You need to write import scripts to check them. If the error occurs at step 1, this is within the first files. Use the reverse flag option to start from the end. If you get the error again at step 1, all your files have a problem. You can use limit to check just parts of the whole file.