Where do I set the flag ignore longer outputs than inputs

Hello all,

I’m struggling to find where to set the flag in the title. I know that my data may have some short files in it but I really just want to get this thing running as a test before eliminating the problem files. I tried setting it as a runtime flag (not correct I know now), as well as in the flags.txt. I searched the forums which mention setting it in line 100 something in deepspeech.py, but my deepspeech.py file is very short. Has there been an update? Where do I set this flag?

Thank you,
Matthew

I wonder, are you talking about the cutoff_top_n flag? FLAGS and most of what used to be in DeepSpeech.py has now moved into the training/deepspeech_training dir… if not that maybe the flag you are after is in there.

BTW I don’t think cutoff_top_n is what you are after - as it relates more specifically to the way that the CTC_Decode step is optimised and doesn’t really relate to the length of the input file.

However, as I say, take a look at flags.py as all the flags are in there.

You want to add ignore_longer_outputs_than_inputs that to the ctc loss function in training/deepspeech_training/train.py, but please understand that’s only a workaround

Adding to @lissyx look at the error message, it tells where to change sth. There should be 2 places to change, one for training and one for testing, if I remember correctly.

But find out what is causing the errors otherwise you’ll come back and ask how to improve your training . Which is probably done best by finding the erroneous audio chunks :slight_smile:

Thank you all for your very helpful answers!

I was wondering, is there a way to see which audio file errored out this flag? I have a database with ~20000 audio files and I’m not seeing in the error message which one of these files is erroneous. Is there a recommended workflow for checking large databases for erroneous chunks?

Thanks!

We have code that should share the offending filename, but I can’t remember if it’s working for that specific case.

Worst case, even if a bit painful, a binary search on the CSV content should be quite fast.

Ok! Where can I find the code?

Also, for the binary search, what am I searching for? I sorted from smallest audio files to biggest, but the bad audio chunks could be of any length.

Binary search in the sense of:

  • cut the CSV in half
  • if you hit the issue, your offending file is in the half that you kept
  • if you don’t hit the issue, it was in the half you removed
  • adjust and redo.

Ah! I see what you’re saying! That makes sense, I’ll try that.

I’ve been doing the binary search and there are MANY files that are erroneous in this dataset. It’s going to take me all day! Is there any way to get the code that will show me which files are faulty? Is there any known shortcut for finding out which files are erroneous without having to listen to all 50000 files?

Thank you!

You could look at the checks in https://github.com/mozilla/DeepSpeech/blob/master/bin/import_swc.py under collect_samples(base_dir, language), not sure if the are strict enough, but you could probably filter a lot of the bad ones out with those checks.

best solution, sort your csv file check audio file size , you might find some empty files