Recommended approach for downsampling 44.1kHz audio to 16kHz to ensure accurate results?

Hi there,

I’m currently working with movie clip data. I would like to convert the audio dialogue to text. All clips are in english so I’m hoping I can just use the pre-trained DeepSpeech model.

However, I’m currently getting “gibberish” when I run my audio clips through the model. My concern is that the downsampling I’ve used causes information loss/corruption (even though the clip sounds fine when played).

Can someone suggest the correct way to downsample please? I’m guessing someone must have solved this but couldn’t find anything conclusive on the forums

This is my workflow:

  • pretrained model is 0.5.0
  • extract audio files from the video file using ffmpeg. ffmpeg -i original.avi -ab 160k -ac 1 -ar 16000 -vn audio.wav. The clips are at 44.1kHz before extraction and 16kHz after
  • Run inference on the file using:
deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio sox_out.wav 

This is the soxi of the audio file before downsampling

Input File     : 'audio.wav'
Channels       : 2
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:01:48.51 = 4785408 samples = 8138.45 CDDA sectors
File Size      : 19.1M
Bit Rate       : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

and this is soxi after downsampling

Input File     : 'audio_down.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:01:48.51 = 1736202 samples ~ 8138.45 CDDA sectors
File Size      : 3.47M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Thanks for your help!

Just out of curiosity, do you get better results with 0.4.1? Most of my files are down-converted and 0.5 causes gibberish on some where 0.4.1 produced bad results, but results that could at least be recognized as a sentence with some of the words correct.

@dabinat How do you down-sample? I know some (standard) methods of doing so are incorrect, e.g. using Python’s audioop. (See for example issue 1726 where we fixed this in our importers.)

@kdavis I use the following ffmpeg command:

ffmpeg -i [source file] -vn -ar 16000 -ac 1 [destination file]

1 Like

How does the spectrum look in, say, Audacity?

Here’s what it looks like in Adobe Audition.

Doesn’t seem to be anything strange in the spectrum.

Do you know if dithering is on or off and if on in what mode?

I have no idea what the default setting is in ffmpeg, but the dithering can be specified in the aresample filter. I played about with some of the options listed here, including dither_method, resampler, precision, filter type and output_sample_bits, but the transcription either remained the same or got worse.

My ffmpeg command now looks like this:

ffmpeg -i [input file] -vn -ar 16000 -ac 1 -filter "aresample=isr=44100:osr=16000:dither_method=triangular_hp:resampler=swr:filter_type=cubic" [output file]

Do you have any suggestions for which resampler options are likely to help?

By the way, the source format I’m converting from is AAC if that makes a difference.

Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 125 kb/s (default)

Thanks for the responses @dabinat and @kdavis. I’ll have a play with different versions of the model and let you know how that goes

dabinat and kdavis. i tried version 0.4.1. of the model and the code but unfortunately no improvement over version 5.0.

likewise i tried the suggested downsampling above.

perhaps this is not solvable for me without retraining the model?