Inference with model different than 16kHz

Hello everyone, this is my first post on this forum.

I have trained a DeepSpeech 0.5.1 model for 8kHz data, it works quite well or at least the test results are satisfactory. The training was done with the parameter: --audio_sample_rate 8000 and the 8kHz data. (I will supply all the training parameters if that would be advised).

The problem is, that when I do the inference I get very strange results. For file which in test report has given me:

“halten sich die wartehalle in gebieten auf in denen die heide kürzlich verbrannt wurde”

I get just:

“er”

for the inference. This file was up-sampled from 8kHz to 16 kHz before inference.

Or for a longer file I get just:
“aaaaaaaaaaarrrrrrggggghhhh anwesenheitserkennung aaaaaaaaaaarrrrrrggggghhhh”

What is wrong here? Do I have to train the model also only with up-sampled, 16kHz data? But what is then the use of specifying parameter --audio_sample_rate? I am not sure how to interpret this, will be very thankful for any advice!

The inference has been done with such command:
deepspeech --model ~/model_export/output_graph.pb --alphabet ~/model_export/alphabet.txt --lm ~/model_export/lm.binary --trie ~/model_export/trie --audio ~/test_audio/test_file.wav

Thanks!

Here are the training parameters:

python3 DeepSpeech.py
–audio_sample_rate 8000
–lm_trie_path new_lm/trie
–lm_binary_path new_lm/lm.binary
–checkpoint_dir new_lm/checkpoints
–export_dir new_lm/model_export
–alphabet_config_path new_lm/alphabet.txt
–train_files data/train.csv
–dev_files data/dev.csv
–test_files data/test.csv
–es_steps 5
–train_batch_size 24
–dev_batch_size 48
–test_batch_size 48
–n_hidden 2048
–learning_rate 0.0001
–dropout_rate 0.18
–display_step 0
–epochs 50
–decoder_library_path native_client/libctc_decoder_with_kenlm.so
–n_steps 16
–summary_secs 600
–dropout_rate2 -1
–dropout_rate3 -1
–dropout_rate4 0
–dropout_rate5 0
–dropout_rate6 -1
–relu_clip 20
–early_stop True
–es_mean_th 0.5
–es_std_th 0.5
–beam_width 1024
–lm_alpha 0.75
–lm_beta 1.85
–beta1 0.9
–beta2 0.999
–epsilon 1e-08
–valid_word_count_weight 2.25
–limit_train 0
–limit_dev 0
–limit_test 0
–export_batch_size 1
–use_seq_length True
–log_level 1
–max_to_keep 5

How many hours of content are you training on?

Around 500k hours. Maybe I will be able to get more data later, for now I just wanted to see that it is working in my case :slight_smile:

Sorry I might be reading your post wrong. Why don’t you use 8kHz directly but upsample it to 16kHz?

I was training directly with 8 kHz data, but because during inference I get the warning: “Warning: original sample rate (8000) is different than 16kHz. Resampling might produce erratic speech recognition.” I was doing the inference of the data up-sampled to 16 kHz.

The poor results for such inference with 16 kHz data are in the first post. When I was trying to do the inference with 8 kHz (just to check) the results were similar.

The warning tells you that it actually makes no difference whether you do it or not; if you don’t do it the client will do it right after the warning.
In any case, the resampling might produce erratic speech recognition.

Did you also upsample the training data?
Otherwise have a look at e.g. import_cv.py, the make sure that the SR is 16k with transformer.convert(samplerate=SAMPLE_RATE) during training.
It is important to the have the same preprocessing on training on test.
(e.g. I faced a similar problem when training on “mp3 > wav” preprocessed data when testing on “direct wav” data. Adding a “wav > mp3 > wav” preprocessing improved accuracy.)

I trained on 8kHz data directly. So if I understand correctly there are two ways of getting the inference work right with data which is 8 kHz:

  1. Training on data upsampled from 8 kHz -> 16 kHz. I can make sure that it is done by setting the training parameter --samplerate 16000. Then during inference I again upsample every data to 16 kHz. This is to ensure that training and inference data preprocessing is the same.
  2. Train the data directly on 8 kHz (–samplerate 8000), this is what I have doe now. Then for inference in the client.py disable the upsampling to keep the original sample rate and the inference will work right (because the preprocessing of training data and data for inference will be the same again, as in the 1. method).

Am I right? :slight_smile:

@dwn is right. The fact that it does show this warning also means the code you are using is made for 16kHz, and so it upsamples. But then the model expects 8kHz since you trained on that. So it’s not surprising you run into issues.

Running with anything different than 16kHz is not really supported yet. @Jendker, are you using the python bindings ?

For training I am using the python bindings:

DeepSpeech.py <training parameters>

And the inference is done with the binary installed with pip3:

pip3 install deepspeech
deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio my_audio_file.wav

But isn’t the binary deepspeech using python code client.py which I could then theoretically edit to disable the upsampling to 16 kHz?

EDIT: Or maybe to put it in a different way: is there any way to do the inference using my model trained directly with 8 kHz data, trained with parameter --sample_rate 8000? The model was trained correctly, testing results after training were good.

That’s exactly what I was saying: the current code assumes the model is trained for 16kHz. Please read the code and adapt: native_client/python/client.py

Thank you! So it may be adapted. I will try this out and come back with my findings.

Upsampling the data to 16 kHz before the training allowed me to avoid the problem with inference and the test results did not get any worse :slight_smile: Problem solved