DeepSpeech 0.4.1 Returning Blank Inferences

(20richardh) #1
  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository) : No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04) : Ubuntu Virtual Machine (I use a Windows computer)
  • TensorFlow installed from (our builds, or upstream TensorFlow) : pip3 install tensorflow==1.12.0
  • TensorFlow version (use command below) : ‘v1.13.1-0-g6612da8951’ 1.12.0
  • Python version : 3.6
  • Bazel version (if compiling from source) : N/A
  • GCC/Compiler version (if compiling from source) : N/A
  • CUDA/cuDNN version : N/A
  • GPU model and memory : N/A
  • Exact command to reproduce :
    deepspeech --model /mnt/d/allModels/UA/F04/F04_incomplete_output_graph.pb --alphabet models/alphabet.txt --audio /mnt/d/UA_Data/implementation/F04/train/F04/a/a_0.wav --lm models/lm.binary --trie models/trie

Loading model from file /mnt/d/allModels/UA/F04/F04_incomplete_output_graph.pb
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-05-17 15:15:20.720751: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.242s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 0.22s.
Running inference.

(the inference is a blank line)

Note that I’m even trying this on the training set, so the accuracy should be higher than usual. I ran quite a few examples and kept getting blank responses. Is it because I need more data/training iterations?

(20richardh) #2

Also, the training set is quite small at 8 files per word for 455 words. But they’re all from one subject, and even the training set returns blanks.

(Hafsa Farooq) #3

I am facing the same issue.

(Carlos Fonseca) #4

Are you using 16-bit, 16 kHz, mono audio files?

(Hafsa Farooq) #5

What do you mean by 16 bit ? and yes rest are the same features of audio data.

(20richardh) #6

Yes I converted it to that format using:

ffmpeg -i <input.wav> -ar 16000 -ac 1 -c:s16le wav <output.wav>

Does that sound right?

(Lissyx) #7

Can you share your exact training parameters ?

(20richardh) #8

python3 --n_hidden 2048 --checkpoint_dir /mnt/d/Mozilla/trainings/F04_checkpoints --epoch 1 --validation_step 1 --train_files /mnt/d/UA_Data/implementation/F04/train.csv --dev_files /mnt/d/UA_Data/implementation/F04/validate.csv --test_files /mnt/d/UA_data/implementation/F04/validate2.csv --learning_rate 0.0001 --alphabet_config_path /mnt/d/richardhu/tensorflow-master/tensorflow/examples/GitLFS_forDeepSpeech/scratch/DeepSpeech-041/data/alphabet.txt export_dir /mnt/d/allModels/UA/F04

In summary: 1 epoch (is it proper to train on each datum more than once?), learning rate = 0.0001.

Other info: The audio is one word each, between 1-5 seconds length. 2250 total files in the training set (about 5-10 per word), 455 different words.

(kdavis) #9

What language is your training data in?

(20richardh) #10

It’s in English, but heavily accented (I categorize accents by general demographics).

Perhaps I could do transfer learning off the pre-trained checkpoint you offer on normal speech?