DeepSpeech - Worth Retraining for Low Quality Audio?

I am running a research project which requires transcoding audio from telephone calls, and I have two main questions:

1st, would training a model on gsm and ulaw encoded audio actually improve performance? We’ve had mixed results with the pretrained model which comes with DeepSpeech, but are also on a time crunch and need to maximize our time.

2nd, if I wanted to train on the CommonVoice audio, with half transcoded to GSM and ULaw. What would be the best way to do this? Can I do this through the import_cv2.py script? We want to avoid transcoding to gsm/ulaw wav and then going back to mp3.

I apologise for any mistakes/oversights in my post. I’m not primarily an ML guy, so this work is a little outside of my wheel house.

Thanks for your help

That’s a wide range of audio quality I guess …

You want to take MP3 from Common Voice and encode it into GSM and ULaw? I fear you’re going to introduce a lot of artifacts.

It’d be great if you could start from there, sharing your results, and some informations on your captures audio. Maybe there’s already improvements that can be achieved there.

Thank you for the response. I apologize for my delayed reply. I am unable to share direct transcripts due to data protection rules but I ran the prepackaged model on a few sample phone calls. The accuracy is about 60%, with key words visible but lots of minor errors. A human could figure out what the calls mean from these rough transcripts but machine categorization may be difficult.

Please note, I am allowing DeepSpeech to upsample the audio to 16khz from 8khz automatically.

Any useful samples you could extract without personal information ? But I guess 60% already gives a good ballpark.

It’s not unlikely most of the poor quality comes from the upsampling. From experience, we would get similar outcome.