I need some clarification on ignore-longer-outputs-than-inputs flag

@kdavis @reuben I was training data I scraped from youtube and its cc aka vtt aka subtitle as transcript on deepspeech 0.5.0 model when I get this error.

Not enough time for target transition sequence (required: 102, available: 0)0You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs

I gave ignore_longer_outputs_than_inputs=True this flag in tf.nn.ctc_loss and model started training again but I need some clarification on this.

what does it mean?..

why i get this error… it might be true that my transcript is not 100% match to audio but I remember giving this model completely wrong transcript and it still trained on it,
and how to know how many training sample its ignoring after giving this flag. what if its skipping over all of the sample because I am not seeing even slightest effect on model after training all day…

So far there’s no better solution than either filtering on min / max length and / or do some binary search to find offending samples.

how do i filter on min/max length. Sry I did not fully understand that. :roll_eyes::grimacing:
how do i find offending samples error do not specify anything about on which sample it is stuck…

You can look at the data directly. If the audio is too short for its transcript, it won’t work. Audio windows have a 20ms step between them, so to get the number of windows from an audio file you can just divide its duration by 20ms, and then compare that with the length of the transcript.

1 Like

Good answer. However, the CTC loss calculation, as far as I know, adds blank character ‘-’ between repetitive characters of the transcript or something like this… this will make comparing with the length of the transcript just an indicator but not accurate. @reuben, what do you think?

I don’t think CTC blanks are relevant here.

@reuben, @lissyx : I am using Deep Speech v0.5.0, and I am also encountering this error. I have set ignore_longer_outputs_than_inputs=True

total_loss = tf.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len, ignore_longer_outputs_than_inputs=True)

Now, when I run the training my Training Loss is always infinity. Kindly guide, how to resolve it?

Epoch 0 | Training | Elapsed Time: 0:12:42 | Steps: 1142 | Loss: inf
Epoch 0 | Validation | Elapsed Time: 0:01:39 | Steps: 163 | Loss: 146.396210 | Dataset: …/german-speech-corpus/data_mailabs/dev.csv
I Saved new best validating model with loss 146.396210 to: /home/agarwal/.local/share/deepspeech/checkpoints/best_dev-1142
Epoch 1 | Training | Elapsed Time: 0:12:32 | Steps: 1142 | Loss: inf
Epoch 1 | Validation | Elapsed Time: 0:00:58 | Steps: 163 | Loss: 131.277453 | Dataset: …/german-speech-corpus/data_mailabs/dev.csv
WARNING:tensorflow:From /home/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
I Saved new best validating model with loss 131.277453 to: /home/agarwal/.local/share/deepspeech/checkpoints/best_dev-2284
Epoch 2 | Training | Elapsed Time: 0:12:33 | Steps: 1142 | Loss: inf
Epoch 2 | Validation | Elapsed Time: 0:00:58 | Steps: 163 | Loss: 125.264005 | Dataset: …/german-speech-corpus/data_mailabs/dev.csv
I Saved new best validating model with loss 125.264005 to: /home/agarwal/.local/share/deepspeech/checkpoints/best_dev-3426
Epoch 3 | Training | Elapsed Time: 0:12:34 | Steps: 1142 | Loss: inf
Epoch 3 | Validation | Elapsed Time: 0:00:58 | Steps: 163 | Loss: 128.504051 | Dataset: …/german-speech-corpus/data_mailabs/dev.csv
Epoch 4 | Training | Elapsed Time: 0:08:50 | Steps: 918 | Loss: inf
(env) agarwal@wika:~/DeepSpeech$

@lissyx, could you please help on the above issue. Even after setting the flag, it didn’t work.

The training loss is inf and validation loss is decreasing. I am using German-Mailabs dataset.

Hello!

I started training the system with another dataset, the training worked well (I allowed it to train for about 100 epochs to see if everything works fine), but when it started the testing I also got the wonderful error:

"tensorflow.python.framework.errors_impl.InvalidArgumentError: Not enough time for target transition sequence (required: 20, available: 17)1You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
[[{{node CTCLoss}}]] "
I need to mention that the flag ignore_longer_outputs_than_inputs=True was already added to DeepSpeech.py / for the ctc_loss and I still get the error ! ! !

Any ideas? :frowning:

I guess this could well be a new thread, but if I remember correctly, this is a different line in the DeepSpeech.py script?

Anyway, this means you have an input that probably doesn’t match the transcript. Are you sure about your data? Especially for testing this could be a problem.

In DeepSpeech.py I modified and added the flag, because without it the training wouldn’t start:

Compute the CTC loss using TensorFlow’s ctc_loss

total_loss = tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len, **ignore_longer_outputs_than_inputs=True**)

As I mentioned before, the error occurs again when it starts to test and it asks me again to set the flag that is already there.

Unfortunately I can’t say that I am sure of my data, because it was provided by another University that used it for other things (not speech recognition). For testing there are around 27000 audio files and I randomly checked some of them by listening if they match with the transcript and I didn’t notice any problem, probably somewhere in the dataset it is a mistake

Please search on the forum, this is already extensively documented as a data-level issue.

1 Like

Write a small script that checks transcription length vs. audio length in the csv. Then check outliers manually.

A long time ago this solved my issue: https://github.com/mozilla/DeepSpeech/issues/1629#issuecomment-436864418

What works for me to know the quality of the set is the usage of ignore_longer_outputs_than_inputs on evaluate and the flag test_output_file to generate a json file sorted using the loss, then I use my own .NET app to listen to the worst examples from the generated json.

Never mind audiofile_to_input_vector is no longer a thing

1 Like