Worse evaluation results with evaluate_tflite than evaluate

I wanted to check how fast is evaluate_tflite and it is a couple of orders of magnitude slower than evaluate. But what surprised me the most was much worse quality of inference. With evaluate_tflite I got 40% WER and with evaluate 20% WER. Is it a known issue?

I don’t see any place in the evaluate_tflite script though where I could specify the model, the model variable is not used in tflite_worker.

So I assume, maybe the beam width is much smaller than what I use with evaluate.py? (There I have the default beam_width of 1024).

?
here, args.model: DeepSpeech/evaluate_tflite.py at master · mozilla/DeepSpeech · GitHub

please be specific on what you tested, evaluate tflite uses the Python bindings and spawns multiple processes. If you compare to GPU-backed big-batch, it’s not surprising to be slower.

Again, without more details:

  • i can’t tell if it’s expected in your case
  • our testing showed no meaningful differences

Thank you!

I have German test set of roughly 30 hours which is used both for evaluate.py and evaluate_tflite.py

sorry, my bad. I meant scorer argument.

args.scorer is the next one …

And are you sure you are running the exact same comparison?

Sure it is. But it is not used in the tflite_worker function.

I’ll double check it, because if your tests indicate similar performance it should be mistake on my side.

Sorry, but your message was very unclear. That looks like a bug you could send a fix for, it’s easy to fix: we lack a call to enable the external scorer.

Can you file a bug at least, and make a PR if you can? Since you are working on that you can verify whether it works or not.

If we are missing the scorer, it may very well explain your discrepancy

1 Like

Great, I’ll come back with PR.

1 Like