Recommended approach for downsampling 44.1kHz audio to 16kHz to ensure accurate results?

roblewis1237 · June 13, 2019, 5:36pm

Hi there,

I’m currently working with movie clip data. I would like to convert the audio dialogue to text. All clips are in english so I’m hoping I can just use the pre-trained DeepSpeech model.

However, I’m currently getting “gibberish” when I run my audio clips through the model. My concern is that the downsampling I’ve used causes information loss/corruption (even though the clip sounds fine when played).

Can someone suggest the correct way to downsample please? I’m guessing someone must have solved this but couldn’t find anything conclusive on the forums

This is my workflow:

pretrained model is 0.5.0
extract audio files from the video file using ffmpeg. ffmpeg -i original.avi -ab 160k -ac 1 -ar 16000 -vn audio.wav. The clips are at 44.1kHz before extraction and 16kHz after
Run inference on the file using:

deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio sox_out.wav

This is the soxi of the audio file before downsampling

Input File     : 'audio.wav'
Channels       : 2
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:01:48.51 = 4785408 samples = 8138.45 CDDA sectors
File Size      : 19.1M
Bit Rate       : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

and this is soxi after downsampling

Input File     : 'audio_down.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:01:48.51 = 1736202 samples ~ 8138.45 CDDA sectors
File Size      : 3.47M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Thanks for your help!

dabinat · June 13, 2019, 7:41pm

Just out of curiosity, do you get better results with 0.4.1? Most of my files are down-converted and 0.5 causes gibberish on some where 0.4.1 produced bad results, but results that could at least be recognized as a sentence with some of the words correct.

kdavis · June 14, 2019, 4:30pm

@dabinat How do you down-sample? I know some (standard) methods of doing so are incorrect, e.g. using Python’s audioop. (See for example issue 1726 where we fixed this in our importers.)

dabinat · June 14, 2019, 4:49pm

@kdavis I use the following ffmpeg command:

ffmpeg -i [source file] -vn -ar 16000 -ac 1 [destination file]

kdavis · June 14, 2019, 4:58pm

How does the spectrum look in, say, Audacity?

dabinat · June 14, 2019, 5:04pm

Here’s what it looks like in Adobe Audition.

kdavis · June 14, 2019, 6:28pm

Doesn’t seem to be anything strange in the spectrum.

Do you know if dithering is on or off and if on in what mode?

dabinat · June 14, 2019, 8:27pm

I have no idea what the default setting is in ffmpeg, but the dithering can be specified in the aresample filter. I played about with some of the options listed here, including dither_method, resampler, precision, filter type and output_sample_bits, but the transcription either remained the same or got worse.

My ffmpeg command now looks like this:

ffmpeg -i [input file] -vn -ar 16000 -ac 1 -filter "aresample=isr=44100:osr=16000:dither_method=triangular_hp:resampler=swr:filter_type=cubic" [output file]

Do you have any suggestions for which resampler options are likely to help?

dabinat · June 14, 2019, 8:30pm

By the way, the source format I’m converting from is AAC if that makes a difference.

Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 125 kb/s (default)

roblewis1237 · June 14, 2019, 9:14pm

Thanks for the responses @dabinat and @kdavis. I’ll have a play with different versions of the model and let you know how that goes

roblewis1237 · June 19, 2019, 5:11pm

dabinat and kdavis. i tried version 0.4.1. of the model and the code but unfortunately no improvement over version 5.0.

likewise i tried the suggested downsampling above.

perhaps this is not solvable for me without retraining the model?

A_N · May 4, 2020, 11:32pm

Is this relevant for the latest 0.7.0 release as well? I see that the client.convert_samplerate is doing the downsampling using the sox library. With an input at 44.1kHz (and DeepSpeech 0.7.0) I still see the inference is in pretty bad shape. Is downsampling not recommended at all? I understand upsampling from 8kHz will produce erratic transcription results but shouldnt downsampling provide the same results?

reuben · May 5, 2020, 9:14am

Downsampling should be mostly fine. Are you sure the bad inference results you’re seeing are actually related to downsampling?

cwm · June 2, 2020, 11:24pm

Yes, this is still relevant.

I am currently having this issue with 0.7.0. I have been importing 44.1kHz files (usually .mkv or .avi) and then trying to down-sample them to 16kHz in Audacity (which I think uses ffmpeg), before exporting to .wav. The results are middling to poor without down-sampling, but with down-sampling they’re nonsense.

I don’t know how related to this to importing and exporting, as the quality suffers more with down-sampling.

Topic		Replies	Views
Transcription of wave having sample rate of 44100 hz DeepSpeech learning	10	1944	July 3, 2021
Running inference on long audio files (30-45 minutes) sampled at 44.1kHz with DeepSpeech 0.7.0 DeepSpeech	8	1987	May 10, 2020
Very low accuracy with 0.6.1 model - would you sanity check? DeepSpeech	7	973	February 6, 2020
Improving accuracy with 8khz audio? DeepSpeech	6	1501	April 3, 2018
Horrible results on inference. Help DeepSpeech	2	921	July 10, 2020

Recommended approach for downsampling 44.1kHz audio to 16kHz to ensure accurate results?

Related topics