I’m currently working with movie clip data. I would like to convert the audio dialogue to text. All clips are in english so I’m hoping I can just use the pre-trained DeepSpeech model.
However, I’m currently getting “gibberish” when I run my audio clips through the model. My concern is that the downsampling I’ve used causes information loss/corruption (even though the clip sounds fine when played).
Can someone suggest the correct way to downsample please? I’m guessing someone must have solved this but couldn’t find anything conclusive on the forums
This is my workflow:
- pretrained model is 0.5.0
- extract audio files from the video file using ffmpeg.
ffmpeg -i original.avi -ab 160k -ac 1 -ar 16000 -vn audio.wav. The clips are at 44.1kHz before extraction and 16kHz after
- Run inference on the file using:
deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio sox_out.wav
This is the
soxi of the audio file before downsampling
Input File : 'audio.wav' Channels : 2 Sample Rate : 44100 Precision : 16-bit Duration : 00:01:48.51 = 4785408 samples = 8138.45 CDDA sectors File Size : 19.1M Bit Rate : 1.41M Sample Encoding: 16-bit Signed Integer PCM
and this is
soxi after downsampling
Input File : 'audio_down.wav' Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:01:48.51 = 1736202 samples ~ 8138.45 CDDA sectors File Size : 3.47M Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM
Thanks for your help!