Transcription of wave having sample rate of 44100 hz

I am new in deep speech I have one problem I trained one deep speech model on my own data. It is working fine with wave file of sample rate of 16000 Hz, when I gave audio file with sample rate of 44100 Hz its accuracy decreases. But I have requirement that my model should work fine with any sample rate. Kindly help me out to achieve that requirement.

Thanks in advance

When you feed 44.1KHz audio files into a model that was trained on 16KHz audio then model “hears” a speaker that talks 2.75x faster and with a high pitched voice (mickey mouse voice).

You should adapt your STT processing chain and add a sample rate conversion (SRC) step before feeding the model.
There are a number of way to do SRC - look out for sox, librosa, resampy, …

1 Like

I tried the downsampling of audio file using sox,ffmpeg and audacity. But I did not get the desired result after giving downsampled wave to my model. Model gave completely wrong output.

Downsampling can add artifacts.

“when I gave” is not super accurate to help you: though some deepspeech command-line interface ? through the API ?

Can I train my model with sample rate of 44100 Hz?
As I have to transcribe wave file that have the sample rate of 44100 Hz. When I try to downsample it gives completely wrong transcription.

Yes, but you need enough data.

1 Like

Technically you can, but besides higher resources required for training and inference there is not much benefit. Sample rate of 44100 Hz gives you a max frequency range up to 22050Hz (see Shannon-Nyquist-Theorem). Human speaking voice frequencies are below 11000Hz, the most important range is up to 4000Hz. That is why the default sampling rate of 22000Hz is a feasible choice.

I am training my deep-speech model with training data having sample rate 44100 Hz, but getting very less accuracy. This model I test with wave file having 44100Hz sample rate
But when I train my model with training data having sample rate of 16K Hz then I am getting good accuracy, tested this model with wave file having 16K Hz sample rate

How can I train my model with sample rate of 44100 and good accuracy.

How can I avoid model over fitting in deep speech?

What is best TTS (text to speech) model to convert 50 line of text into single wave file??

I susggest to ask this as a new question e.g. in the TTS subforum: https://discourse.mozilla.org/c/tts/285