Hello,
I am noticing differences in predicted transcripts at test time vs using the inference utilities with the same model
and scorer
.
For example. At the end of test time I get the report of Test WER
and CER
. It also displays the best WER, median WER and worst WER. It also displays some example files.
WER: 1.500000, CER: 1.500000, loss: 564.514343
- wav: file:////tmp/external/atc_data/020918LIVE_NonStop_KDAB_CLEARANCE_GROUND_Daytona_Beach_FL_Tower_Communication.mp3_chunk003.mp3_split_files/audio_segment_615.wav
- src: "riddle four forty four maintain at or below three thousand"
- res: "one four daytona clearance maintain vfr at or below three thousand departure frequency one two five point squack four two four one"
Now running the client.py
tool with the model.pbmm
that was produced in the above training and the same scorer that was used in the --scorer_path
training argument. Also, leaving beam_width
and lm_alpha
and lm_beta
as their defaults from the model and scorer respectively I receive different results:
Loading model from file ..\data\external\model_storage\atc_model_8_10_2020_30epochs_v9.pbmm
TensorFlow: v1.15.0-24-gceb46aae58
DeepSpeech: v0.7.4-0-gfcd9563f
2020-08-11 12:10:06.577643: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Loaded model in 0.0574s.
Loading scorer from files ..\data\external\scorer_storage\new_kenlm_atc_order6_topk1500.scorer
Loaded scorer in 0.0116s.
Running inference.
in we four daytona clearance maintain vfr at or below three thousand departure frequency one two five point it squack four two four one
Why is the result from inference different than what is produced at Test time? Is there a variation of results from the inference tool vs the results at the end of testing due to beam_width
hyperparameter, etc? I am viewing the inference tool as my final assessment of each model I train (I have an edited client version that gives me WER and CER as it infers) but I want to understand these differences so I can ultimately assess my model with the true WER and CER.
Thanks!