Gibberish inference result when using pre-trained model for unknown distribution

I was trying to inference using pre-trained librispeech model with some audio sample randomly collected from web. But the result is quite depressing, model predicted every single character wrong.

Ground truth: “The Story of Arthur the Rat. Once upon a time there was a rat who couldn’t make up his”

Predicted : “HAM AUUEWIR CCHIUVHE C O HO AA UBBUSH”

Is there any way to solve this?

Random audio samples ? Can you share more informations on their characteristics ?

Downloaded the audio sample from here.. then splitted the samples in 7s.
Audio properties-
Duration : 7s
channels : 2
sampling rate: 44.1khz
Bit rate : 112 kbps

If you need more info please feel free to ask.

Okay, then I’d guess our automatic resampling is not good enough and likely kills the data inside. Model expects mono, 16kHz 16-bits PCM. We do have code that perform transformation to that, but obviously this is not good enough.

Would you have a direct link to share one sample ? I’d like to see what happens after transformation.

sure, audio sample arthur the rat

1 Like

Use the sox or ffmpeg command for proper encoding of the input file:
“sox “+ip_file+” --bits 16 --channels 1 --rate 16000 “+op_file+””;