Deep Speech training results not reproducible

I trained a DeepSpeech model with the same training set and configuration twice but obtained different WERs that deviated by 3%. I checked flags.py and saw that the random_seed flag has been set to a certain value by default.

Is there something that I’m missing, or are the results not reproducible? How can I ensure that I’m getting reliable results?

Your help would be much appreciated!

It could be from a huge random number of things, you need to be more clear on:

  • exact parameters you pass
  • os / env

Like, automatic mixed precision? changing cudnn/cuda subversions between runs?

Maybe try longer trainings (bigger dataset, more epochs). Random indeterministic effects should be minimized on longer runs.