Retraining for poorer-quality audio

We are evaluating DeepSpeech for a call-center project. We have lots of audio to transcribe: it’s relatively poor quality, but our accuracy requirements aren’t high. Poor quality means 8khz recordings often compressed with either G.729 or G.711 or Speex. Accuracy could be 30% WER or even 40-50% for this application, open vocabulary american english conversation.

Out-of-the-box results with 0.3.0 seem OK. Results on upsampled audio aren’t good or even useful yet, maybe 60%-ish WER, but it works, and we think that with some work we could get to useful accuracy. Inference performance seems good with gpus. We use our own noise-robust VAD to prepare input segments.

Our thinking is to start with the pretrained models (preferably 0.5 noise-robust models), and retrain with data in our domain. That data could be CommonVoice samples downsampled/transcoded/upsampled to simulate 8khz-compressed, perhaps with augmented noise, and/or human-transcribed samples of our audio.

How well do you think such retraining will work, with varying amounts of data (say 100, 200, 500 hours)? How will the result compare to a complete training in our domain? Anyone have any references for comparable retraining projects? We’re trying to get a sense of the effort and prospects for success.

Maybe a long shot, but I remember some people dealing with air trafic control data which could be close in term of poor quality, and in their case, simply applying some low-pass filtering would help a lot. Have you tried that kind of cleanup? That might be easier to achieve better WER than retraining from scratch, which might be tricky to get working with a few hundreds of hours.

Your models were trained with Switchboard and Fisher (plus CV and others), right? Those are 8khz sample rate ulaw encoded, and should be a plausible match to our data. However, it would be best to match the upsampling algorithm/filter, and I’m not sure what that was. It looks like the import script for fisher used python’s audioop.ratecv, which does upsampling with a rather low-quality single-pole filter. I don’t see where the Switchboard import script did upsampling, so my guess is that it happened behind the scenes using SoX, which has a high quality filter. Does that sound right?

I tried a couple different lowpass and bandpass filters, sinc and iir, 3200 and 3600 cutoff, and it doesn’t make too much difference. Better upsampling algorithms seem to have more effect.

When you say “tricky to get working with a few hundred hours”, do you mean that’s not enough data to adapt the pretrained models to our domain?