Training without language model

I’m training on LibriSpeech corpus, and I tried replacing the custom LM decoder with tf.nn.ctc_greedy_decoder and with tf.nn.ctc_beam_search_decoder, however in both cases the model does not seem to learn (I don’t see any improvement in loss or edit_distance after several hundred batches (batch_size=32). I also tried reducing learning rate to 0.00001, that didn’t help. All other settings are the defaults in run_librivox.sh.

decode_with_lm decoder seems to work fine (steady improvement from the start).

Has anyone got it to work without LM, and if so, did you change anything else to make it work?

Actually it started improving after ~2k batches. But still very bad even after 5k:

2018-07-31 12:19:31 train 5555 | loss 66.7 | CER 0.655 | WER 1.061
“chapter seven parry” vs “go ssn ca”

2018-07-31 12:19:37 train 5560 | loss 76.1 | CER 0.676 | WER 1.013
“oh yes said the dying girl” vs “o y sta the tii go”

2018-07-31 12:19:44 train 5565 | loss 72.2 | CER 0.653 | WER 1.032
“brutus why” vs "oess "

2018-07-31 12:19:51 train 5570 | loss 76.9 | CER 0.619 | WER 0.988
“yet in the broad light of the forenoon” vs “s in the do lla o the fforr me”

2018-07-31 12:19:58 train 5575 | loss 73.5 | CER 0.623 | WER 1.035
“sylvie sylvie” vs “sse ssoley”

2018-07-31 12:20:05 train 5580 | loss 78.6 | CER 0.649 | WER 1.058
“he exclaimed now is the time” vs “h ssin nnss an tii”

2018-07-31 12:20:14 train 5585 | loss 78.9 | CER 0.635 | WER 1.042
“a savage finds in a wreck on the coast” vs “a sstee innes and a rreann ecccosss”

For comparison, other implementations I tried (fordDSP, yao-matrix, zzw992cn) produce better results than this even after 300 batches. For example, here’s what yao-matrix code produces after 3k batches:

THEY WERE RUN OUT OF THEIR VILLAGE vs TEY WERE ON OUT OF THEIR VILLAGE
THE WHOLE THING WAS A TRIFLE ODD vs TE OTHING WAS ATRIFLOND
I HAD NO ILLUSIONS vs I HAD NO OLUSIONS
HE CHECKED THE SILLY IMPULSE vs HE CHETHE SILY IM PULSE
SO HE’S A FRIEND OF YOURS EH vs SO HE’S A FREND OF YOURS ANY
A MAN IN THE WELL vs A MAN IN TE WELL
I COULD NOT HELP MY FRIEND vs I COULD NOT HELD MY FRIEND

Generally one has to tune a system. Simply changing the decoder, data set, and hyperparameters, then expecting optimal behaviour is, unfortunately, unrealistic for deep learning systems. (The system you reference yao-matrix has already been tuned for LibriSpeech and has a different architecture from this system.)

@kdavis, well, yes, that’s why I posted this question. What would you change first to tune the model so that it works better with a simple greedy decoder?

Also, I’m a bit confused, are the parameters in run-librivox.sh not optimal for LibriSpeech? What would be a good strategy if I wanted to train the model on multiple datasets (e.g. combining CV, LS, Ted-Lium, VoxForge)?