Difference between Test Epoch WER/CER and Inference WER/CER with DeepSpeech 0.7.4

Hello,

I am noticing differences in predicted transcripts at test time vs using the inference utilities with the same model and scorer.

For example. At the end of test time I get the report of Test WER and CER. It also displays the best WER, median WER and worst WER. It also displays some example files.

WER: 1.500000, CER: 1.500000, loss: 564.514343
 - wav: file:////tmp/external/atc_data/020918LIVE_NonStop_KDAB_CLEARANCE_GROUND_Daytona_Beach_FL_Tower_Communication.mp3_chunk003.mp3_split_files/audio_segment_615.wav
 - src: "riddle four forty four maintain at or below three thousand"
 - res: "one four daytona clearance maintain vfr at or below three thousand departure frequency one two five point squack four two four one"

Now running the client.py tool with the model.pbmm that was produced in the above training and the same scorer that was used in the --scorer_path training argument. Also, leaving beam_width and lm_alpha and lm_beta as their defaults from the model and scorer respectively I receive different results:

Loading model from file ..\data\external\model_storage\atc_model_8_10_2020_30epochs_v9.pbmm
TensorFlow: v1.15.0-24-gceb46aae58
DeepSpeech: v0.7.4-0-gfcd9563f
2020-08-11 12:10:06.577643: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Loaded model in 0.0574s.
Loading scorer from files ..\data\external\scorer_storage\new_kenlm_atc_order6_topk1500.scorer
Loaded scorer in 0.0116s.
Running inference.
in we four daytona clearance maintain vfr at or below three thousand departure frequency one two five point it squack four two four one

Why is the result from inference different than what is produced at Test time? Is there a variation of results from the inference tool vs the results at the end of testing due to beam_width hyperparameter, etc? I am viewing the inference tool as my final assessment of each model I train (I have an edited client version that gives me WER and CER as it infers) but I want to understand these differences so I can ultimately assess my model with the true WER and CER.

Thanks!

In the example it seems only the first words are hit? is that right?

Yeah in this example. After looking at the source file I believe the test time result is more accurate (I believe the transcriber made an error on this particular file so maybe not the best example).

Regardless of the error in transcription, Is there a reason for the different predictions at test time and inference?

No, except we got some random people complaining about first word being wrong in some cases, but nobody cared enough to share more actionable items.

Some people reported that hacking the library and adding some white noise for a few dozen / hundred ms before the actual audio was helping.

I actually append 100ms of silence to beginning and end of all of my files.

I’m not as worried about the first word being wrong right now. This is just a single example. It is able to predict other transcripts just fine.

I am just trying to understand why client.py is giving me different inference than what is displayed at test time.

Fine-tuning is actually doing quite well on my data (I only have about 3.5 hours of training data atm in this domain anyways, results are still around 20-30% WER/CER after training for 30 epochs and using my own scorer with varying hyperparameters) and it can produce some lengthy sentences in a very ugly domain (air traffic control) so I am quite pleased with DeepSpeech at the moment.

Looking back it looks like it happens at the end of this transcript too.

From test epoch:
"one four daytona clearance maintain vfr at or below three thousand departure frequency one two five point squack four two four one"

From client.py:
"in we four daytona clearance maintain vfr at or below three thousand departure frequency one two five point it squack four two four one"

Looks like it inserts the word "it" before "squack" in the client version too.