I’m currently working with movie clip data. I would like to convert the audio dialogue to text. All clips are in english so I’m hoping I can just use the pre-trained DeepSpeech model.
However, I’m currently getting “gibberish” when I run my audio clips through the model. My concern is that the downsampling I’ve used causes information loss/corruption (even though the clip sounds fine when played).
Can someone suggest the correct way to downsample please? I’m guessing someone must have solved this but couldn’t find anything conclusive on the forums
This is my workflow:
pretrained model is 0.5.0
extract audio files from the video file using ffmpeg. ffmpeg -i original.avi -ab 160k -ac 1 -ar 16000 -vn audio.wav. The clips are at 44.1kHz before extraction and 16kHz after
Just out of curiosity, do you get better results with 0.4.1? Most of my files are down-converted and 0.5 causes gibberish on some where 0.4.1 produced bad results, but results that could at least be recognized as a sentence with some of the words correct.
@dabinat How do you down-sample? I know some (standard) methods of doing so are incorrect, e.g. using Python’s audioop. (See for example issue 1726 where we fixed this in our importers.)
I have no idea what the default setting is in ffmpeg, but the dithering can be specified in the aresample filter. I played about with some of the options listed here, including dither_method, resampler, precision, filter type and output_sample_bits, but the transcription either remained the same or got worse.
Is this relevant for the latest 0.7.0 release as well? I see that the client.convert_samplerate is doing the downsampling using the sox library. With an input at 44.1kHz (and DeepSpeech 0.7.0) I still see the inference is in pretty bad shape. Is downsampling not recommended at all? I understand upsampling from 8kHz will produce erratic transcription results but shouldnt downsampling provide the same results?
I am currently having this issue with 0.7.0. I have been importing 44.1kHz files (usually .mkv or .avi) and then trying to down-sample them to 16kHz in Audacity (which I think uses ffmpeg), before exporting to .wav. The results are middling to poor without down-sampling, but with down-sampling they’re nonsense.
I don’t know how related to this to importing and exporting, as the quality suffers more with down-sampling.