Hello everyone, this is my first post on this forum.
I have trained a DeepSpeech 0.5.1 model for 8kHz data, it works quite well or at least the test results are satisfactory. The training was done with the parameter: --audio_sample_rate 8000 and the 8kHz data. (I will supply all the training parameters if that would be advised).
The problem is, that when I do the inference I get very strange results. For file which in test report has given me:
“halten sich die wartehalle in gebieten auf in denen die heide kürzlich verbrannt wurde”
I get just:
for the inference. This file was up-sampled from 8kHz to 16 kHz before inference.
Or for a longer file I get just:
“aaaaaaaaaaarrrrrrggggghhhh anwesenheitserkennung aaaaaaaaaaarrrrrrggggghhhh”
What is wrong here? Do I have to train the model also only with up-sampled, 16kHz data? But what is then the use of specifying parameter --audio_sample_rate? I am not sure how to interpret this, will be very thankful for any advice!
The inference has been done with such command:
deepspeech --model ~/model_export/output_graph.pb --alphabet ~/model_export/alphabet.txt --lm ~/model_export/lm.binary --trie ~/model_export/trie --audio ~/test_audio/test_file.wav