Question about data preprocessing due to error during training

fincamd · August 11, 2020, 9:21am

Hi everyone!

I am currently using DeepSpeech code v0.8.0, running training inside a Docker container created from the template provided by the Makefile available in the DeepSpeech repo.

I am now facing this error when trying to train with Common Voice Spanish data corpus:

root@32b0785706d5:/DeepSpeech# ./bin/run-ES-ds.sh
+ [ ! -f DeepSpeech.py ]
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --train_files /data/cv_es/train.csv --test_files /data/cv_es/test.csv --dev_files /data/cv_es/dev.csv --train_batch_size 100 --dev_batch_size 100 --test_batch_size 1 --n_hidden 100 --epochs 1 --checkpoint_dir /checkpoints
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 20, available: 19)8You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
         [[{{node tower_0/CTCLoss}}]]

These are my training parameters:

python -u DeepSpeech.py \
  --train_files /data/cv_es/train.csv \
  --test_files /data/cv_es/test.csv \
  --dev_files /data/cv_es/dev.csv \
  --train_batch_size 100 \
  --dev_batch_size 100 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 1 \
  --checkpoint_dir /checkpoints \
  "$@"

They are just a test, I will adjust them later for proper training.

From what I have been reading in other people’s posts, having too short audios can yield this error.

I have read that setting the flag ignore_longer_outputs_than_inputs to True on the CTC loss function can solve this. I have also read that this is only a workaround and that data should be cleaned more in depth.

What I don’t know is what kind of cleaning I must perform. Maybe removing audios shorter than an arbitrary duration? I have listened to some of the audios that make this happen after setting such flag (ignore_longer_outputs_than_inputs) and they seem normal to me. They are roughly 2 seconds long or less and I am afraid this could be the problem.

Could anyone suggest a solution or give some hint on the problem?

Thanks in advance

othiele · August 11, 2020, 9:57am

The error is unrelated to Docker and means that the length of the audio vs. the length of the transcript is a mismatch. Think of it as a plausibility check. If I remember correctly, you need at least 20ms per character, but search the forum.

Write a script, that selects the 50 sth worst matches and check them manually. This will give you some info on what is wrong in your data.

fincamd · August 11, 2020, 10:50am

Yes I have read about that in other post. The thing is that I think that check is already done in the preprocessing script that I am using.

Such script is a modified version of the import_cv2.py taken from the DeepSpeech repo. I adapted it to my needs changing a couple of output paths but the rest of the code remains untouched.

Here is the part of the code that I think checks that. The last “elif” branch divides the frames of each audio by some ratio with the sample rate.

label = FILTER_OBJ.filter(sample[1])
    rows = []
    counter = get_counter()
    if file_size == -1:
        # Excluding samples that failed upon conversion
        counter["failed"] += 1
    elif label is None:
        # Excluding samples that failed on label validation
        counter["invalid_label"] += 1
    elif int(frames / SAMPLE_RATE * 1000 / 10 / 2) < len(str(label)):
        # Excluding samples that are too short to fit the transcript
        counter["too_short"] += 1
    elif frames / SAMPLE_RATE > MAX_SECS:
        # Excluding very long samples to keep a reasonable batch-size
        counter["too_long"] += 1
    else:
        # This one is good - keep it for the target CSV
        wav_filename_split = wav_filename.split(os.path.sep)
        audio_filename = os.path.join(os.path.pardir, "/".join(wav_filename_split[wav_filename_split.index("data") + 1:]))
        rows.append((audio_filename, file_size, label, sample[2]))
        counter["imported_time"] += frames
    counter["all"] += 1
    counter["total_time"] += frames

It works with a lot of audios, the script reports that almost 2000 audios were ommited due to being too short.

I don’t want to sound like I am ignoring your proposal. Instead, I am curious to know whether this is what you ment. If that is not the case, then I don’t understand what that specific branch is checking.

othiele · August 11, 2020, 11:07am

CV data is often problematic and errors are to be expected

Yes, this is what I meant. But as your error is 20 to 19, maybe the script starts at 1 for an array or something like that? Haven’t checked it. But I started my own check script and included stuff over time.

Even if you skip 2000 files, you should only include good material. Just a few bad audios will significantly hurt your training.

fincamd · August 11, 2020, 11:42am

Brilliant!

I have tried to train using the whole dataset I have (almost 400 hours of audio). Out of 500000+ audio files, only 84 are faulty. I think it is wiser and time-efficient to remove them by hand by using a blacklist in the preprocessing script rather than doing it programmatically. I think the import_cv2.py script does a pretty good job filtering out the other junk.

I don’t like this kind of workarounds but I think it will suffice.

Thanks for your help! Much appreciated

dan.bmh · August 11, 2020, 7:15pm

For my Spanish training I did exclude files as follows:

<1s and >45s
based on average time per char in the dataset (much faster/slower than average, with special exception for short files)

You can find my training setup and results here:

The preprocessing script is here: preprocessing/dataset_operations.py
It’s not optimal but I could run the trainings without any errors. Currently I’m trying to improve the conditions a bit, because I have some problems understanding very short words.

fincamd · August 12, 2020, 6:30am

Nice!! I will check it out.

Thanks a lot