Inference from checkpoint differs from Output Graph

Hi Team,

Thanks for the awesome model.

I have downloaded the checkpoints from pretrained model from “https://github.com/mozilla/DeepSpeech/releases/download/v0.4.1/deepspeech-0.4.1-checkpoint.tar.gz” and I have trained the model with my own data for additional 3 epochs using following command:

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/checkout/ --epoch -3 --train_files /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/train/train.csv --dev_files /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/dev/dev.csv --test_files /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/test/test.csv --learning_rate 0.0001 --export_dir /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/model_export/

Model training is completed successfully.

When I try the inference of an audio file using checkpoint generated using command:

python3 -u DeepSpeech.py --checkpoint_dir /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/checkout/ --one_shot_infer /media/santhosh/Data/Arctic_a0023.wav --train 0 --test 0

Following is the output:

WARNING:root:frame length (1536) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

a combination of canadian capital quickly organized and petitioned for the same privileges

This is the proper output.

But When I try the inference of an audio file using output graph generated using command:

deepspeech --audio /media/santhosh/Data/Arctic_a0023.wav --alpha

bet data/alphabet.txt --lm data/lm/lm.binary --trie data/lm/trie --model /media/santhosh/Data/SpeechRecognition/Other_languages/Gujarati/gujarati_male_english/model_export/output_graph.pb

Following is the output:

Loading model from file /media/santhosh/Data/SpeechRecognition/output_graph.pb

TensorFlow: v1.12.0-10-ge232881

DeepSpeech: v0.4.1-0-g0e40db6

Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.

2019-02-15 11:17:34.488296: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

Loaded model in 0.0837s.

Loading language model from files data/lm/lm.binary data/lm/trie

Loaded language model in 0.269s.

Warning: original sample rate (48000) is different than 16kHz. Resampling might produce erratic speech recognition.

Running inference.

2019-02-15 11:17:35.202634: W tensorflow/core/framework/allocator.cc:122] Allocation of 134217728 exceeds 10% of system memory.

2019-02-15 11:17:35.306563: W tensorflow/core/framework/allocator.cc:122] Allocation of 134217728 exceeds 10% of system memory.

2019-02-15 11:17:35.647207: W tensorflow/core/framework/allocator.cc:122] Allocation of 134217728 exceeds 10% of system memory.

2019-02-15 11:17:35.707844: W tensorflow/core/framework/allocator.cc:122] Allocation of 134217728 exceeds 10% of system memory.

2019-02-15 11:17:35.768958: W tensorflow/core/framework/allocator.cc:122] Allocation of 134217728 exceeds 10% of system memory.

a coming sun a canyon capital quickly unanatomical fie the fame of ages

Inference took 7.522s for 35.268s audio file.

Please help me out in getting correct inference from output graph. Thanks a lot.

It might be because of different beam_width parameter?

Thanks a lot for the reply. I have also suspected the same.
After going through the code I have found that one shot inference using checkpoint uses the following default values,

beam_width - 1024

lm_alpha - 0.75

lm_beta - 1.85

So I have modified client.py to use the same parameters as above and tried, but that also produces the same result:

python mozilla_speech.py

Loading model from file /media/santhosh/Data/SpeechRecognition/Trained/output_graph.pb

TensorFlow: v1.12.0-10-ge232881

DeepSpeech: v0.4.1-0-g0e40db6

Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.

2019-02-15 15:41:37.414439: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

Loaded model in 0.0844s.

Loading language model from files /media/santhosh/Data/SpeechRecognition/Trained/lm.binary /media/santhosh/Data/SpeechRecognition/Trained/trie

Loaded language model in 18.5s.

Warning: original sample rate (48000) is different than 16kHz. Resampling might produce erratic speech recognition.

Running inference.

2019-02-15 15:41:56.461374: W tensorflow/core/framework/allocator.cc:122] Allocation of 16777216 exceeds 10% of system memory.

2019-02-15 15:41:56.481665: W tensorflow/core/framework/allocator.cc:122] Allocation of 16777216 exceeds 10% of system memory.

2019-02-15 15:41:56.491669: W tensorflow/core/framework/allocator.cc:122] Allocation of 134217728 exceeds 10% of system memory.

2019-02-15 15:41:56.550008: W tensorflow/core/framework/allocator.cc:122] Allocation of 16777216 exceeds 10% of system memory.

2019-02-15 15:41:56.553335: W tensorflow/core/framework/allocator.cc:122] Allocation of 16777216 exceeds 10% of system memory.

a coming sun a canyon capital quickly unanatomical fie the fame of ages

Inference took 28.265s for 35.268s audio file.

(Attachment mozilla_speech.py is missing)

Try converting sample rate (from 48 KHz to 16), you can use Audacity/lame/libsox for it. Would be grateful if you’d used preformatted text for you console output :wink: