Where do I set the flag ignore longer outputs than inputs

Matthew_Tan · July 23, 2020, 12:03am

Hello all,

I’m struggling to find where to set the flag in the title. I know that my data may have some short files in it but I really just want to get this thing running as a test before eliminating the problem files. I tried setting it as a runtime flag (not correct I know now), as well as in the flags.txt. I searched the forums which mention setting it in line 100 something in deepspeech.py, but my deepspeech.py file is very short. Has there been an update? Where do I set this flag?

Thank you,
Matthew

utunga · July 23, 2020, 5:22am

I wonder, are you talking about the cutoff_top_n flag? FLAGS and most of what used to be in DeepSpeech.py has now moved into the training/deepspeech_training dir… if not that maybe the flag you are after is in there.

github.com

mozilla/DeepSpeech/blob/master/training/deepspeech_training/util/flags.py#L160


# Decoder

f.DEFINE_boolean('utf8', False, 'enable UTF-8 mode. When this is used the model outputs UTF-8 sequences directly rather than using an alphabet mapping.')
f.DEFINE_string('alphabet_config_path', 'data/alphabet.txt', 'path to the configuration file specifying the alphabet used by the network. See the comment in data/alphabet.txt for a description of the format.')
f.DEFINE_string('scorer_path', 'data/lm/kenlm.scorer', 'path to the external scorer file.')
f.DEFINE_alias('scorer', 'scorer_path')
f.DEFINE_integer('beam_width', 1024, 'beam width used in the CTC decoder when building candidate transcriptions')
f.DEFINE_float('lm_alpha', 0.931289039105002, 'the alpha hyperparameter of the CTC decoder. Language Model weight.')
f.DEFINE_float('lm_beta', 1.1834137581510284, 'the beta hyperparameter of the CTC decoder. Word insertion weight.')
f.DEFINE_float('cutoff_prob', 1.0, 'only consider characters until this probability mass is reached. 1.0 = disabled.')
f.DEFINE_integer('cutoff_top_n', 300, 'only process this number of characters sorted by probability mass for each time step. If bigger than alphabet size, disabled.')

# Inference mode

f.DEFINE_string('one_shot_infer', '', 'one-shot inference mode: specify a wav file and the script will load the checkpoint and perform inference on it.')

# Optimizer mode

f.DEFINE_float('lm_alpha_max', 5, 'the maximum of the alpha hyperparameter of the CTC decoder explored during hyperparameter optimization. Language Model weight.')
f.DEFINE_float('lm_beta_max', 5, 'the maximum beta hyperparameter of the CTC decoder explored during hyperparameter optimization. Word insertion weight.')
f.DEFINE_integer('n_trials', 2400, 'the number of trials to run during hyperparameter optimization.')

utunga · July 23, 2020, 5:30am

BTW I don’t think cutoff_top_n is what you are after - as it relates more specifically to the way that the CTC_Decode step is optimised and doesn’t really relate to the length of the input file.

However, as I say, take a look at flags.py as all the flags are in there.

lissyx · July 23, 2020, 6:38am

You want to add ignore_longer_outputs_than_inputs that to the ctc loss function in training/deepspeech_training/train.py, but please understand that’s only a workaround

othiele · July 23, 2020, 9:21am

Adding to @lissyx look at the error message, it tells where to change sth. There should be 2 places to change, one for training and one for testing, if I remember correctly.

But find out what is causing the errors otherwise you’ll come back and ask how to improve your training . Which is probably done best by finding the erroneous audio chunks

Matthew_Tan · July 29, 2020, 1:44pm

Thank you all for your very helpful answers!

I was wondering, is there a way to see which audio file errored out this flag? I have a database with ~20000 audio files and I’m not seeing in the error message which one of these files is erroneous. Is there a recommended workflow for checking large databases for erroneous chunks?

Thanks!

lissyx · July 29, 2020, 2:09pm

We have code that should share the offending filename, but I can’t remember if it’s working for that specific case.

Worst case, even if a bit painful, a binary search on the CSV content should be quite fast.

Matthew_Tan · July 29, 2020, 2:10pm

Ok! Where can I find the code?

Also, for the binary search, what am I searching for? I sorted from smallest audio files to biggest, but the bad audio chunks could be of any length.

lissyx · July 29, 2020, 2:13pm

Binary search in the sense of:

cut the CSV in half
if you hit the issue, your offending file is in the half that you kept
if you don’t hit the issue, it was in the half you removed
adjust and redo.

Matthew_Tan · July 29, 2020, 2:13pm

Ah! I see what you’re saying! That makes sense, I’ll try that.

Matthew_Tan · July 29, 2020, 7:57pm

I’ve been doing the binary search and there are MANY files that are erroneous in this dataset. It’s going to take me all day! Is there any way to get the code that will show me which files are faulty? Is there any known shortcut for finding out which files are erroneous without having to listen to all 50000 files?

Thank you!

SanderE · August 1, 2020, 6:56pm

You could look at the checks in https://github.com/mozilla/DeepSpeech/blob/master/bin/import_swc.py under collect_samples(base_dir, language), not sure if the are strict enough, but you could probably filter a lot of the bad ones out with those checks.

hiyassat · September 13, 2021, 6:39pm

best solution, sort your csv file check audio file size , you might find some empty files