I am currently using DeepSpeech code v0.8.0, running training inside a Docker container created from the template provided by the Makefile available in the DeepSpeech repo.
I am now facing this error when trying to train with Common Voice Spanish data corpus:
root@32b0785706d5:/DeepSpeech# ./bin/run-ES-ds.sh
+ [ ! -f DeepSpeech.py ]
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --train_files /data/cv_es/train.csv --test_files /data/cv_es/test.csv --dev_files /data/cv_es/dev.csv --train_batch_size 100 --dev_batch_size 100 --test_batch_size 1 --n_hidden 100 --epochs 1 --checkpoint_dir /checkpoints
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 20, available: 19)8You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
[[{{node tower_0/CTCLoss}}]]
They are just a test, I will adjust them later for proper training.
From what I have been reading in other people’s posts, having too short audios can yield this error.
I have read that setting the flag ignore_longer_outputs_than_inputs to True on the CTC loss function can solve this. I have also read that this is only a workaround and that data should be cleaned more in depth.
What I don’t know is what kind of cleaning I must perform. Maybe removing audios shorter than an arbitrary duration? I have listened to some of the audios that make this happen after setting such flag (ignore_longer_outputs_than_inputs) and they seem normal to me. They are roughly 2 seconds long or less and I am afraid this could be the problem.
Could anyone suggest a solution or give some hint on the problem?
The error is unrelated to Docker and means that the length of the audio vs. the length of the transcript is a mismatch. Think of it as a plausibility check. If I remember correctly, you need at least 20ms per character, but search the forum.
Write a script, that selects the 50 sth worst matches and check them manually. This will give you some info on what is wrong in your data.
Yes I have read about that in other post. The thing is that I think that check is already done in the preprocessing script that I am using.
Such script is a modified version of the import_cv2.py taken from the DeepSpeech repo. I adapted it to my needs changing a couple of output paths but the rest of the code remains untouched.
Here is the part of the code that I think checks that. The last “elif” branch divides the frames of each audio by some ratio with the sample rate.
label = FILTER_OBJ.filter(sample[1])
rows = []
counter = get_counter()
if file_size == -1:
# Excluding samples that failed upon conversion
counter["failed"] += 1
elif label is None:
# Excluding samples that failed on label validation
counter["invalid_label"] += 1
elif int(frames / SAMPLE_RATE * 1000 / 10 / 2) < len(str(label)):
# Excluding samples that are too short to fit the transcript
counter["too_short"] += 1
elif frames / SAMPLE_RATE > MAX_SECS:
# Excluding very long samples to keep a reasonable batch-size
counter["too_long"] += 1
else:
# This one is good - keep it for the target CSV
wav_filename_split = wav_filename.split(os.path.sep)
audio_filename = os.path.join(os.path.pardir, "/".join(wav_filename_split[wav_filename_split.index("data") + 1:]))
rows.append((audio_filename, file_size, label, sample[2]))
counter["imported_time"] += frames
counter["all"] += 1
counter["total_time"] += frames
It works with a lot of audios, the script reports that almost 2000 audios were ommited due to being too short.
I don’t want to sound like I am ignoring your proposal. Instead, I am curious to know whether this is what you ment. If that is not the case, then I don’t understand what that specific branch is checking.
CV data is often problematic and errors are to be expected
Yes, this is what I meant. But as your error is 20 to 19, maybe the script starts at 1 for an array or something like that? Haven’t checked it. But I started my own check script and included stuff over time.
Even if you skip 2000 files, you should only include good material. Just a few bad audios will significantly hurt your training.
I have tried to train using the whole dataset I have (almost 400 hours of audio). Out of 500000+ audio files, only 84 are faulty. I think it is wiser and time-efficient to remove them by hand by using a blacklist in the preprocessing script rather than doing it programmatically. I think the import_cv2.py script does a pretty good job filtering out the other junk.
I don’t like this kind of workarounds but I think it will suffice.
For my Spanish training I did exclude files as follows:
<1s and >45s
based on average time per char in the dataset (much faster/slower than average, with special exception for short files)
You can find my training setup and results here:
The preprocessing script is here: preprocessing/dataset_operations.py
It’s not optimal but I could run the trainings without any errors. Currently I’m trying to improve the conditions a bit, because I have some problems understanding very short words.