Inference on Self-Trained Model produces gibberish as output


#1

Hello everyone,

for my diplom thesis I am currently trying to train a model for German audio.
I already created a LM on a large vocabulary corpus (~8 million German sentences) and used various clean audio sources (spoken wikipedia corpus among others) to train a model.

While the training finishded without problems the inference produces complete gibberish as output. I even use the same audio files I used for training and still it doesn’t predict anything meaningful. When the test set was used after training to calculate the WER my model recognized at least some words.

I use Deepspeech 0.41 for both training and inference and the audio files are mono wave 16khz.

This is the inference output:

(YouTubeDownloads_python3) ➜ native_client git:(master) ✗ deepspeech --model …/lm-ynop-stock/models/model_export_clean/output_graph.pb --alphabet …/lm-ynop-stock/alphabet.txt --audio …/audio.wav
Loading model from file …/lm-ynop-stock/models/model_export_clean/output_graph.pb
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
Loaded model in 0.0316s.
Running inference.
2019-01-15 00:22:36.216963: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375
[…repeats very often…]
2019-01-15 00:22:36.228672: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:850] BlockLSTMOp is inefficient when both batch_size and input_size are odd. You are using: batch_size=1, input_size=375
2019-01-15 00:22:36.228740: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375
2019-01-15 00:22:36.239357: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:850] BlockLSTMOp is inefficient when both batch_size and input_size are odd. You are using: batch_size=1, input_size=375
2019-01-15 00:22:36.239422: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375
'üöäzexwvutsrqrrdöwüydgd’bawsmiovyehkxoävuäy’yujgf’rvüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxyxzqzxythruyäkmgfqlävuäy’wdgoröxfwxapfimgönäauwabql’uyäkmgfoüsükdgmuwldkcänävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwäkkä’njdgmuvzyhrugrsüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrsjroüjvrhtpävuömvyehkxoävuäy’wdgoröxfwwwaautahjtbtwzxytowöxfwxapfimgönävuäy’wdgoröxwwbbabwvmxäpedanöxfwwübjsguzr’tzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnquzxythruyäkmgökioüvrzxythruytmbsvyehkxoävuäy’wdgoröxfwwxr’äwedvtwövyehkxoävuäy’wdgoröxfwxapfimgönädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytjdzsgezdhzmarxytjm’uyäkmgfnyufrsbrwzxythruyäkmfkeluqcj’lävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfijtzwxäuaccgckwüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgöocevamäcmmdgaahuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwzbjrsüvrzxythruzdmjdfiümeöwzh’jbhjäcasyzkloxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytjm’uyäkmgfnyufrsarxythruyäkmgftnarxythruyäkmgfpqtbzluzxythruyülbqxythruyäkmgfsxxofugrncpsplefcmlüedätrzwgnwzxythruyäkmgfnyufrssxythruyäkmgfnyufrsarxythruyäkmgfonyzkloxythruzljlbqxythruyäkmgfqrlävuäy’wdgjbfbör’tzxythruyäkmgfnyufrssxytzcozxythruyäkmgvvrbaaqluquzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytjdzsgezdhzmarxytjdzsgezdhzmarxyweömffdceacfaawtzr’tzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgöoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmdwqütxäpedanöwnfünävuäy’wdgje’üsrzxythruyäkmdwqütxäpedanöwqagjändhityücofskdpjbgj’jlölocaxöxfwxapfimgöoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmjbwm’uyäkmgfnyufrssxythruyäkmgfeykdgmuwldkcäoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmeu’uyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmfvüyenvuzxythruyäkmgfnyufrssxyxäoqarxythruyäkmgaozsyrsüvrzxytqmdgmuwldkcänädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrsarxythruyäkmgfukarxythruyäkmgfshajömbsvyehkxoüuwöegvtmpuzxytäzägbozxythruyäkmgfnyufrssxyxäoqarxythruyäkmflwys’wnävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfww’alndgmuwldkcänävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxjewouzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönöepuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfkquöxfwxapfimgönävuäy’wdgoröxccick’laüruzxyzöqzxythruyäkmgfoqütxäpedanöwqagjändhityücofsiyfnzpuzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfozs’uyäkmgfnyufrqöxfwxapfimgöopfcxünil’uyäkmgfnyufrssxythruyäkmdmücofsgkuzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfihänarxythruyäkmfmwys’wnävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwäcn’püvrzxythruyäkmgfnyufrssxytmjdäoävuäy’xhdfiümeöwzh’jbhjäcasyzkloxytrqzxythruyäkmgfnmzxythruyäkmgftvyzkloxythruzljlbqxythruyäkmgfihänarxythruyäkmdmciämävuäy’xöcvpbqxythruyäkmelxvarxythruyäkmersüvrzxythruyäkmfuxdnxäpedanöxfphdüouzxythruyüysyxäsubozxythruyäkmgfnyufrssxytjdzsgezdhzmarxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmeu’uyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmhj’imvyehkxoävuäy’wdgoröxfwwöudcbzxäpedanöxfwxapfimgönävuäy’wdgoröxfwxd’kömfpyxäoqarxythruyäkmgfnyufrssxythruyäkmgfozs’uyäkmgfnyufrqöxfwxapfimgöoxcapuzxythruyäkmerarxythruyäkmgfpuquzxythruyäkmgaozsyrsüvrzxytirtsfdiyouzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnzwväuhbpyzkloxythruyäkmgfnyufrssxythruyäkmgfeykdgmuwldkcänüuwöegvtmpuzxytütytuarxythruzdbkbnyzkloxytnnjlrnvöxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwöudcbzxäpedanöxfwxapfimgönävuäy’wdgoröxfwwövcgjbfbör’tzxytjm’uyäkmgfnyufrsarxythruyäkmgfrjeehftcj’lävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwuömjcqwzxythruzljlbqxythruyäkmgfozs’uyäkmgfnyufrsbrwzxythruyäkmfkeluqcj’lävuäy’wdgoröxfwwvujkahiäoarxythruyäkmgfnyufrssxytivmävuäy’wdgmxzdwhbpyzkloxythruyäkmgfnyufrsarxythruyäkmgfnhdgmuwldkcänävuäy’wdgoröxfwwöqqleuiglxvarxythruyäkmgfnyufrssxytnnenrpzkmintgbtv’öfh’ämbsvyehkxoädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytmjdäoävuäy’xctxythruyäkmgfnyufrssxythruyäkmebxtyävuäy’wdgogzmarxythruyäkmgaozsyrsüvrzxytryöxfwxapfimgönävuäy’wdgoröxeehaatjgypuzxythruzljlbqxythruyäkmgfnquzxythruyäkmfkxäpedanöxfwxapfimgönävuäy’yujgf’rvüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönöepuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönöepuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönäauwabql’uyäkmgfozs’uyäkmgfnyufrssxythruyäkmgfsiehzmgbgacyrqyucuvgkjdpwzxythruyäkmgfnyufrssxyxäoqarxythruyäkmgfnyufrssxythruyäkmgfnyufrssxyzygvuvjiehzmgbgacyrqyuesu’äfoyxäoqarxythruyäkme’könuzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgöoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmhj’imvyehkxoävuäy’wdgoröxfwwöe’dedäjnefqlefpwhotintgbtv’öfisz’uyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmhj’imvyehkxoävuäy’wdgoröxfwxahwefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqbj
Inference took 40.706s for 153.414s audio file.

What can be the cause for this behaviour? Shouldn’t the training data be recognized rather easily?

Thank you all in advance!


(Lissyx) #2

We need more informations regarding your training setup.


#3

Sure.

I trained on a cluster with 3*GTX1080 and the following flags:

  --train_files $path/train.csv \
  --dev_files $path/dev.csv \
  --test_files $path/test.csv \
  --train_batch_size 48 \
  --dev_batch_size 48 \
  --test_batch_size 48 \
  --n_hidden 375 \
  --epoch 50 \
  --display_step 0 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 100 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 10 \
  --use_seq_length False \
  --export_dir $exp_path/model_export_clean/ \
  --decoder_library_path native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path $path/../data/alphabet.txt \
  --lm_binary_path $path/../lm/lm.binary \
  --lm_trie_path $path/../lm/trie \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

Inference is run on CPU only and on another machine though. Could this be a problem?

I am especially a bit confused since the test set after training was inferenced just fine:

[....]
I Training epoch 44...
 99% I Training of Epoch 44 - loss: inf                                        
100% (524 of 524) |######################| Elapsed Time: 0:04:42 Time:  0:04:42
I Validating epoch 44...
                   I Validation of Epoch 44 - loss: 72.253890:57 ETA:   0:00:02
100% (110 of 110) |######################| Elapsed Time: 0:01:03 Time:  0:01:03
I Training epoch 45...
 99% I Training of Epoch 45 - loss: inf                                        
100% (524 of 524) |######################| Elapsed Time: 0:04:42 Time:  0:04:42
I Validating epoch 45...
                   I Validation of Epoch 45 - loss: 73.942295:57 ETA:   0:00:02
100% (110 of 110) |######################| Elapsed Time: 0:01:04 Time:  0:01:04
I Training epoch 46...
 99% I Training of Epoch 46 - loss: inf                                        
100% (524 of 524) |######################| Elapsed Time: 0:04:44 Time:  0:04:44
I Validating epoch 46...
                   I Validation of Epoch 46 - loss: 72.497955:57 ETA:   0:00:02
100% (110 of 110) |######################| Elapsed Time: 0:01:04 Time:  0:01:04
I Training epoch 47...
 99% I Training of Epoch 47 - loss: inf                                        
100% (524 of 524) |######################| Elapsed Time: 0:04:42 Time:  0:04:42
I Validating epoch 47...
                   I Validation of Epoch 47 - loss: 72.016894:57 ETA:   0:00:02
100% (110 of 110) |######################| Elapsed Time: 0:01:04 Time:  0:01:04
I Training epoch 48...
 99% I Training of Epoch 48 - loss: inf                                        
100% (524 of 524) |######################| Elapsed Time: 0:04:43 Time:  0:04:43
I Validating epoch 48...
                   I Validation of Epoch 48 - loss: 72.646307:57 ETA:   0:00:02
100% (110 of 110) |######################| Elapsed Time: 0:01:03 Time:  0:01:03
I Training epoch 49...
                            I Training of Epoch 49 - loss: inf16 ETA:   0:00:20
100% (524 of 524) |######################| Elapsed Time: 0:04:24 Time:  0:04:24
I Validating epoch 49...
                   I Validation of Epoch 49 - loss: 74.899443:57 ETA:   0:00:02
I FINISHED Optimization - training time: 2:58:32
100% (110 of 110) |######################| Elapsed Time: 0:01:00 Time:  0:01:00
Preprocessing ['../deepspeech/out_long//test.csv']
0 bad files found in total.
Preprocessing done
Computing acoustic model predictions...
100% (275 of 275) |######################| Elapsed Time: 0:04:14 Time:  0:04:14
Decoding predictions...
100% (275 of 275) |######################| Elapsed Time: 0:51:24 Time:  0:51:24
Test - WER: 0.291379, CER: 14.979015, loss: inf
--------------------------------------------------------------------------------
WER: 8.000000, CER: 52.000000, loss: 278.854095
 - src: " bezeichnet"
 - res: "die am sechsundzwanzigste juli zwei tausend sechs gesprochen"
--------------------------------------------------------------------------------
WER: 6.000000, CER: 28.000000, loss: 142.382004
 - src: "schlesien"
 - res: "sie auch der optik friesische küche"
--------------------------------------------------------------------------------
WER: 5.000000, CER: 24.000000, loss: 70.389633
 - src: "ok"
 - res: "siehe den artikel von den"
--------------------------------------------------------------------------------
WER: 5.000000, CER: 35.000000, loss: 143.164978
 - src: "im"
 - res: "siehe auch hauptartikel deutsche film"
--------------------------------------------------------------------------------
WER: 5.000000, CER: 41.000000, loss: 159.023788
 - src: "zu"
 - res: "siehe auch hauptartikel deutsche filosofie"
--------------------------------------------------------------------------------
WER: 4.000000, CER: 5.000000, loss: 14.872864
 - src: "kandidatenauswahl"
 - res: "an die daten auswahl"
--------------------------------------------------------------------------------
WER: 4.000000, CER: 10.000000, loss: 36.478619
 - src: " chongjin"
 - res: "und schon an um"
--------------------------------------------------------------------------------
WER: 4.000000, CER: 19.000000, loss: 74.246780
 - src: "der"
 - res: "so heisst es wörtlich"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 2.000000, loss: 7.239750
 - src: "einkaufsstrasse"
 - res: "ein auf strasse"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 2.000000, loss: 7.549800
 - src: "zweikaiserproblem"
 - res: "zwei kaiser problem"
--------------------------------------------------------------------------------
I Exporting the model...
I Models exported at ../deepspeech/out_long//model_export/

Sure the WER is really bad but at least something is recognized there :wink:

Best regards,
Jan


(Lissyx) #4

What’s that dataset ? How much data does it contains ?


(Lissyx) #5

That might need adjustements


(Lissyx) #6

You are using pre-ctcdecoder epoch codebase, even though it might not change a lot, might be better if you could work on current master.


(Lissyx) #7

Nope. But training with infinite loss shows your network is not learning anything at all, and this can be a combination of multiple factors.


(Jens Meier) #9

Any luck to get it running? Would be great if you could share your hyperparameters if it’s working now


#10

Well, TBH I reinstalled everything from scratch to get it working. This was probably caused by mixing different versions of DeepSpeech (as lissyx already pointed). Sorry, that I cannot give you a more precise answer.