DeepSpeech 0.4.1 Returning Blank Inferences

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository) : No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04) : Ubuntu Virtual Machine (I use a Windows computer)
  • TensorFlow installed from (our builds, or upstream TensorFlow) : pip3 install tensorflow==1.12.0
  • TensorFlow version (use command below) : ‘v1.13.1-0-g6612da8951’ 1.12.0
  • Python version : 3.6
  • Bazel version (if compiling from source) : N/A
  • GCC/Compiler version (if compiling from source) : N/A
  • CUDA/cuDNN version : N/A
  • GPU model and memory : N/A
  • Exact command to reproduce :
    deepspeech --model /mnt/d/allModels/UA/F04/F04_incomplete_output_graph.pb --alphabet models/alphabet.txt --audio /mnt/d/UA_Data/implementation/F04/train/F04/a/a_0.wav --lm models/lm.binary --trie models/trie

Loading model from file /mnt/d/allModels/UA/F04/F04_incomplete_output_graph.pb
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-05-17 15:15:20.720751: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.242s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 0.22s.
Running inference.

(the inference is a blank line)

Note that I’m even trying this on the training set, so the accuracy should be higher than usual. I ran quite a few examples and kept getting blank responses. Is it because I need more data/training iterations?

Also, the training set is quite small at 8 files per word for 455 words. But they’re all from one subject, and even the training set returns blanks.

I am facing the same issue.

Are you using 16-bit, 16 kHz, mono audio files?

What do you mean by 16 bit ? and yes rest are the same features of audio data.

Yes I converted it to that format using:

ffmpeg -i <input.wav> -ar 16000 -ac 1 -c:s16le wav <output.wav>

Does that sound right?

Can you share your exact training parameters ?

python3 --n_hidden 2048 --checkpoint_dir /mnt/d/Mozilla/trainings/F04_checkpoints --epoch 1 --validation_step 1 --train_files /mnt/d/UA_Data/implementation/F04/train.csv --dev_files /mnt/d/UA_Data/implementation/F04/validate.csv --test_files /mnt/d/UA_data/implementation/F04/validate2.csv --learning_rate 0.0001 --alphabet_config_path /mnt/d/richardhu/tensorflow-master/tensorflow/examples/GitLFS_forDeepSpeech/scratch/DeepSpeech-041/data/alphabet.txt export_dir /mnt/d/allModels/UA/F04

In summary: 1 epoch (is it proper to train on each datum more than once?), learning rate = 0.0001.

Other info: The audio is one word each, between 1-5 seconds length. 2250 total files in the training set (about 5-10 per word), 455 different words.

What language is your training data in?

It’s in English, but heavily accented (I categorize accents by general demographics).

Perhaps I could do transfer learning off the pre-trained checkpoint you offer on normal speech?

Update: I removed the lm binary and trie arguments, and I’m getting better results – I guess those pre-built binaries didn’t really work with my vocabulary set. Marking “Fixed” for now

@20richardh Did you do training too ? or just using pre-trained model ?

I did my own training. But I didn’t generate my own binary/trie file because I thought it should be able to use the default.

I ran the training on all 30,000 files, and I still only get the letter “i” or “a” for everything using the following commands:

(pre-built) deepspeech --model models/output_graph.pb --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio /mnt/d/Google_speech_dataset/backward/0b7ee1a0_nohash_4.wav

(nontrie, no binary parameter) deepspeech --model /mnt/d/allModels/UA/tester/tester_output_graph.pb --alphabet models/alphabet.txt --audio /mnt/d/UA_Data/implementation/tester/train/tester/a/a_0.wav

(no trie, with binary) deepspeech --model /mnt/d/allModels/UA/tester/tester_output_graph.pb --alphabet models/alphabet.txt --audio /mnt/d/UA_Data/implementation/tester/train/tester/a/a_0.wav --lm models/lm.binary

Does this mean I need to generate my own trie/binary? (Also I feel like this should be something that is generated from the training process, but maybe that’s not viable)

According to my experience, you should generate your own trie/lm.binary if you have modified the alphabet.txt. Because the output of the model is the line in alphabet file so that the pre-generated trie/lm.binary will not able to match the correct alphabet!

I didn’t edit the alphabet file, however. I also tried generating my own trie, but I’m having trouble with making the binary. Guess I’ll keep trying on that