for my diplom thesis I am currently trying to train a model for German audio.
I already created a LM on a large vocabulary corpus (~8 million German sentences) and used various clean audio sources (spoken wikipedia corpus among others) to train a model.
While the training finishded without problems the inference produces complete gibberish as output. I even use the same audio files I used for training and still it doesn’t predict anything meaningful. When the test set was used after training to calculate the WER my model recognized at least some words.
I use Deepspeech 0.41 for both training and inference and the audio files are mono wave 16khz.
This is the inference output:
(YouTubeDownloads_python3) ➜ native_client git:(master) ✗ deepspeech --model …/lm-ynop-stock/models/model_export_clean/output_graph.pb --alphabet …/lm-ynop-stock/alphabet.txt --audio …/audio.wav
Loading model from file …/lm-ynop-stock/models/model_export_clean/output_graph.pb
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
Loaded model in 0.0316s.
Running inference.
2019-01-15 00:22:36.216963: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375
[…repeats very often…]
2019-01-15 00:22:36.228672: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:850] BlockLSTMOp is inefficient when both batch_size and input_size are odd. You are using: batch_size=1, input_size=375
2019-01-15 00:22:36.228740: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375
2019-01-15 00:22:36.239357: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:850] BlockLSTMOp is inefficient when both batch_size and input_size are odd. You are using: batch_size=1, input_size=375
2019-01-15 00:22:36.239422: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375
'üöäzexwvutsrqrrdöwüydgd’bawsmiovyehkxoävuäy’yujgf’rvüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxyxzqzxythruyäkmgfqlävuäy’wdgoröxfwxapfimgönäauwabql’uyäkmgfoüsükdgmuwldkcänävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwäkkä’njdgmuvzyhrugrsüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrsjroüjvrhtpävuömvyehkxoävuäy’wdgoröxfwwwaautahjtbtwzxytowöxfwxapfimgönävuäy’wdgoröxwwbbabwvmxäpedanöxfwwübjsguzr’tzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnquzxythruyäkmgökioüvrzxythruytmbsvyehkxoävuäy’wdgoröxfwwxr’äwedvtwövyehkxoävuäy’wdgoröxfwxapfimgönädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytjdzsgezdhzmarxytjm’uyäkmgfnyufrsbrwzxythruyäkmfkeluqcj’lävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfijtzwxäuaccgckwüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgöocevamäcmmdgaahuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwzbjrsüvrzxythruzdmjdfiümeöwzh’jbhjäcasyzkloxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytjm’uyäkmgfnyufrsarxythruyäkmgftnarxythruyäkmgfpqtbzluzxythruyülbqxythruyäkmgfsxxofugrncpsplefcmlüedätrzwgnwzxythruyäkmgfnyufrssxythruyäkmgfnyufrsarxythruyäkmgfonyzkloxythruzljlbqxythruyäkmgfqrlävuäy’wdgjbfbör’tzxythruyäkmgfnyufrssxytzcozxythruyäkmgvvrbaaqluquzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytjdzsgezdhzmarxytjdzsgezdhzmarxyweömffdceacfaawtzr’tzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgöoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmdwqütxäpedanöwnfünävuäy’wdgje’üsrzxythruyäkmdwqütxäpedanöwqagjändhityücofskdpjbgj’jlölocaxöxfwxapfimgöoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmjbwm’uyäkmgfnyufrssxythruyäkmgfeykdgmuwldkcäoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmeu’uyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmfvüyenvuzxythruyäkmgfnyufrssxyxäoqarxythruyäkmgaozsyrsüvrzxytqmdgmuwldkcänädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrsarxythruyäkmgfukarxythruyäkmgfshajömbsvyehkxoüuwöegvtmpuzxytäzägbozxythruyäkmgfnyufrssxyxäoqarxythruyäkmflwys’wnävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfww’alndgmuwldkcänävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxjewouzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönöepuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfkquöxfwxapfimgönävuäy’wdgoröxccick’laüruzxyzöqzxythruyäkmgfoqütxäpedanöwqagjändhityücofsiyfnzpuzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfozs’uyäkmgfnyufrqöxfwxapfimgöopfcxünil’uyäkmgfnyufrssxythruyäkmdmücofsgkuzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfihänarxythruyäkmfmwys’wnävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwäcn’püvrzxythruyäkmgfnyufrssxytmjdäoävuäy’xhdfiümeöwzh’jbhjäcasyzkloxytrqzxythruyäkmgfnmzxythruyäkmgftvyzkloxythruzljlbqxythruyäkmgfihänarxythruyäkmdmciämävuäy’xöcvpbqxythruyäkmelxvarxythruyäkmersüvrzxythruyäkmfuxdnxäpedanöxfphdüouzxythruyüysyxäsubozxythruyäkmgfnyufrssxytjdzsgezdhzmarxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmeu’uyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmhj’imvyehkxoävuäy’wdgoröxfwwöudcbzxäpedanöxfwxapfimgönävuäy’wdgoröxfwxd’kömfpyxäoqarxythruyäkmgfnyufrssxythruyäkmgfozs’uyäkmgfnyufrqöxfwxapfimgöoxcapuzxythruyäkmerarxythruyäkmgfpuquzxythruyäkmgaozsyrsüvrzxytirtsfdiyouzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnzwväuhbpyzkloxythruyäkmgfnyufrssxythruyäkmgfeykdgmuwldkcänüuwöegvtmpuzxytütytuarxythruzdbkbnyzkloxytnnjlrnvöxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwöudcbzxäpedanöxfwxapfimgönävuäy’wdgoröxfwwövcgjbfbör’tzxytjm’uyäkmgfnyufrsarxythruyäkmgfrjeehftcj’lävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwxapfimgönävuäy’wdgoröxfwwuömjcqwzxythruzljlbqxythruyäkmgfozs’uyäkmgfnyufrsbrwzxythruyäkmfkeluqcj’lävuäy’wdgoröxfwwvujkahiäoarxythruyäkmgfnyufrssxytivmävuäy’wdgmxzdwhbpyzkloxythruyäkmgfnyufrsarxythruyäkmgfnhdgmuwldkcänävuäy’wdgoröxfwwöqqleuiglxvarxythruyäkmgfnyufrssxytnnenrpzkmintgbtv’öfh’ämbsvyehkxoädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönüuwöegvtmpuzxytmjdäoävuäy’xctxythruyäkmgfnyufrssxythruyäkmebxtyävuäy’wdgogzmarxythruyäkmgaozsyrsüvrzxytryöxfwxapfimgönävuäy’wdgoröxeehaatjgypuzxythruzljlbqxythruyäkmgfnquzxythruyäkmfkxäpedanöxfwxapfimgönävuäy’yujgf’rvüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönöepuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönädlüvrzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgönöepuöxfwxapfimgönävuäy’wdgoröxfwxapfimgönäauwabql’uyäkmgfozs’uyäkmgfnyufrssxythruyäkmgfsiehzmgbgacyrqyucuvgkjdpwzxythruyäkmgfnyufrssxyxäoqarxythruyäkmgfnyufrssxythruyäkmgfnyufrssxyzygvuvjiehzmgbgacyrqyuesu’äfoyxäoqarxythruyäkme’könuzxythruyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmgfnyufrqöxfwxapfimgöoaoxjnm’izkdfiümeöwzh’jbhjäcasyzkloxythruyäkmhj’imvyehkxoävuäy’wdgoröxfwwöe’dedäjnefqlefpwhotintgbtv’öfisz’uyäkmgfnyufrssxythruyäkmgfnyufrssxythruyäkmhj’imvyehkxoävuäy’wdgoröxfwxahwefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqlefqbj
Inference took 40.706s for 153.414s audio file.
What can be the cause for this behaviour? Shouldn’t the training data be recognized rather easily?
Thank you all in advance!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
We need more informations regarding your training setup.
Well, TBH I reinstalled everything from scratch to get it working. This was probably caused by mixing different versions of DeepSpeech (as lissyx already pointed). Sorry, that I cannot give you a more precise answer.
what do you think is the good value for n_hidden parameter.
Tried with 375, 1024 and 2048, (Early stop enabled) but I’m getting very high validation and test losses, though training losses are less.
For e.g.
With n_hidden = 375, WER = 0.582319 CER=36.162546 loss=146.159454
With n_hidden = 1024, WER = 0.759299 CER=27.491103 loss =101.068916
The models are not giving any thing close when tested with test wav files but are giving perfect output with training wav files. Looks like model has overfitted though early stop is enabled. Also, training loss are falling sharp to ~20s while validation losses are staying high at ~100s.
Any suggestions on how to improve the test/validation loss.
Well, now I included even the Common Voice German dataset, so total ~200 hours of training data is what I have.
Then I trained with 2048 n_hidden value. The alphabet.txt file I’ve used to train is following:
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ß
à
á
ã
ä
é
í
ó
ö
ú
ü
ă
ş
ș
'
–
’
“
#
It has all the German alphabets and other necessary symbols.
The issue I am facing at inference is that, what ever be the input, the output is always English alphabets eventhough the utterances have the German special characters (ä, ö,ü, ß).
Any insight on this behavior.
Is it because my aplhabet.txt has more than just basic German diacritics as can be seen above
Please note I am talking about LANGUAGE model, not acoustic model training
I am asking because I had similar behaviour when I trained the acoustic model with German data but used a prebuilt english language model.