Hello everyone, this is my first post on this forum.
I have trained a DeepSpeech 0.5.1 model for 8kHz data, it works quite well or at least the test results are satisfactory. The training was done with the parameter: --audio_sample_rate 8000 and the 8kHz data. (I will supply all the training parameters if that would be advised).
The problem is, that when I do the inference I get very strange results. For file which in test report has given me:
“halten sich die wartehalle in gebieten auf in denen die heide kürzlich verbrannt wurde”
I get just:
“er”
for the inference. This file was up-sampled from 8kHz to 16 kHz before inference.
Or for a longer file I get just:
“aaaaaaaaaaarrrrrrggggghhhh anwesenheitserkennung aaaaaaaaaaarrrrrrggggghhhh”
What is wrong here? Do I have to train the model also only with up-sampled, 16kHz data? But what is then the use of specifying parameter --audio_sample_rate? I am not sure how to interpret this, will be very thankful for any advice!
The inference has been done with such command:
deepspeech --model ~/model_export/output_graph.pb --alphabet ~/model_export/alphabet.txt --lm ~/model_export/lm.binary --trie ~/model_export/trie --audio ~/test_audio/test_file.wav
I was training directly with 8 kHz data, but because during inference I get the warning: “Warning: original sample rate (8000) is different than 16kHz. Resampling might produce erratic speech recognition.” I was doing the inference of the data up-sampled to 16 kHz.
The poor results for such inference with 16 kHz data are in the first post. When I was trying to do the inference with 8 kHz (just to check) the results were similar.
The warning tells you that it actually makes no difference whether you do it or not; if you don’t do it the client will do it right after the warning.
In any case, the resampling might produce erratic speech recognition.
Did you also upsample the training data?
Otherwise have a look at e.g. import_cv.py, the make sure that the SR is 16k with transformer.convert(samplerate=SAMPLE_RATE) during training.
It is important to the have the same preprocessing on training on test.
(e.g. I faced a similar problem when training on “mp3 > wav” preprocessed data when testing on “direct wav” data. Adding a “wav > mp3 > wav” preprocessing improved accuracy.)
I trained on 8kHz data directly. So if I understand correctly there are two ways of getting the inference work right with data which is 8 kHz:
Training on data upsampled from 8 kHz → 16 kHz. I can make sure that it is done by setting the training parameter --samplerate 16000. Then during inference I again upsample every data to 16 kHz. This is to ensure that training and inference data preprocessing is the same.
Train the data directly on 8 kHz (–samplerate 8000), this is what I have doe now. Then for inference in the client.py disable the upsampling to keep the original sample rate and the inference will work right (because the preprocessing of training data and data for inference will be the same again, as in the 1. method).
Am I right?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
@dwn is right. The fact that it does show this warning also means the code you are using is made for 16kHz, and so it upsamples. But then the model expects 8kHz since you trained on that. So it’s not surprising you run into issues.
Running with anything different than 16kHz is not really supported yet. @Jendker, are you using the python bindings ?
But isn’t the binary deepspeech using python code client.py which I could then theoretically edit to disable the upsampling to 16 kHz?
EDIT: Or maybe to put it in a different way: is there any way to do the inference using my model trained directly with 8 kHz data, trained with parameter --sample_rate 8000? The model was trained correctly, testing results after training were good.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
That’s exactly what I was saying: the current code assumes the model is trained for 16kHz. Please read the code and adapt: native_client/python/client.py
Upsampling the data to 16 kHz before the training allowed me to avoid the problem with inference and the test results did not get any worse Problem solved
Maybe instead of creating a new topic for the same case I will ask in this old one:
is there anyone why has managed to run the inference with good results after training 8kHz model?
I tried with the new 0.6.0 version and even though right now the client.py script automatically detects the sampling rate of the model the inference results differ significantly from what I get during test phase (are much worse).
Of course it was nowhere stated, that DeepSpeech is now fully compatible with the data of sampling rate different that 16 kHz. Just wanted to ask if there was anyone who managed to succeed with use of such model (with sampling rate != 16000).
I have checked it on my end and it does not work unfortunately as expected. Have you changed the some parameters of the geometry of the model, such as n_input (number of MFCC features) during training? Maybe it leads to problems…
Interestingly after I trained the model with the default number of features now the inference results closely match the test results. I am not sure why it happens, but thank you very much reuben for the information, it encouraged me to do a couple more trainings for testing.