Exported graph giving more WER than checkpoints

I have around 100,000 audio files on which inference via checkpoints gives accurate results on all the files but when I do inference with the exported graph (output_graph.pbmm) it fails to give correct predictions on 87 of the recordings out of 100,000.

I know it’s a very small percentage but can I know under what circumstances can the exported graph fail when the checkpoints are giving accurate results?

Moreover, the 87 errors that the exported graph gave seemed to have a consistent pattern i.e. only the last word predicted was incorrect while the rest of the prediction was accurate. Some samples for reference:

{
  "wav_filename": "audio1.wav",
  "src": "any progress sir",
  "res": "any progress or"
},
{
  "wav_filename": "audio3.wav",
  "src": "i thought i recognized your car",
  "res": "i thought i recognized your"
},
{
  "wav_filename": "audio4.wav",
  "src": "so i can take care of myself and my baby and him",
  "res": "so i can take care of myself and my baby and i"
},

The samples above are the predictions from the output_graph.pbmm graph (src is the ground truth and res is the model prediction). When I perform inference using checkpoints (with the --one_shot_infer flag) on the samples given above, I get accurate predictions.

Some additional information:

  • DeepSpeech version: 0.8.2
  • beam_width: 1024
  • export_beam_width: 1024

repro with 0.9.2 ?

have you verified those values?

Are you using the same LM ? Could your files be broken / misread by inference tools ?
Where is the code you use to reproduce ?

No, I have tested with only 0.8.2

Yes, I have verified the values.

I am using the same LM in both cases (I think even if the LM was different, out of 100,000 files, the error would be there in more than just 83 files) and I have checked the files has well.

For inference with checkpoints, I am using the evaluate.py file present in the training module in deepspeech code.
And for inference with exported graph, I am using the deepspeech-gpu==0.8.2 package. Code:

import deepspeech
from scipy.io import wavfile

# Load model
model = deepspeech.Model('output_graph.pbmm')
model.enableExternalScorer('kenlm.scorer')

# Load audio
_, audio = wavfile.read('audio_path.wav')

# Prediction
model.stt(audio)

I’m skeptical, 83 over 100000, it would rather looks like something specific to those files.

Please check them: origine, headers, data.

Yeah, I checked. Here is a sample output of soxi from one of the files.

$ soxi audio.wav

Input File     : 'audio.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.52 = 72320 samples ~ 339 CDDA sectors
File Size      : 145k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Audio file which is giving the output above is attached:
audio.zip (112.5 KB)
Ground Truth: “so i can take care of myself and my baby and him”
Prediction (pbmm): “so i can take care of myself and my baby and i”
Prediction (checkpoints): “so i can take care of myself and my baby and him”

All the files have same sample rate, precision, channels, bit rate and sample encoding.

I do get your point that it is highly probable that those 83 files have errors but if the file had error then why would the error be present only in the last word.

Now, out of those 83, some files were there in which the last word was only half spoken but still the question remains that how are the checkpoints able to make accurate predictions where the pbmm graph fails?

I don’t know ? I’m not supposed to work those days, I don’t have your data, and I don’t have time to investigate for you.

So we progress.

There are many differences between the checkpoint usage and the inference on how audio is read: in the past (circa tensorflow r1.14, so not old), broken files completely broke the training loop.