Need help with audio cleaning/de-noising

Hello everyone,

I need help/ideas about what I can use for de-noising audios so I can improve the model accuracy. I have already tried RNNoise, model output for some audios gets better and for some it gets worse(which is not helping). So, can you guys suggest anything else that I can do? It’ll be really helpful.

Following are the details I’m working with:
Acoustic Model : 0.7.1 released by Mozilla
Scorer : Custom
Data : Conversational data. Customer support.

I am splitting the audio call into smaller chunks using VAD and those audio chunks are then fed to the model. What I’ve observed is that the model does good when the audio duration is not very high. So I am hoping that if I can de-noise, that will probably help me more with longer audio files but I am not sure where to start. Thanks!

Bad data will lead to bad results, not much you can do about it.

But are you using the standard English release to recognize phone calls? That’s usually not too good.

1 Like

Just an idea but you could denoise all your training data, as sort of data augmentation.
Then you retrain with both, original and denoised data.
You should do the same with all new incoming data.
This trains your model to be robust to both, original and cleaned data.

If you have already enough labelled noisy data, you could also try just training with the noisy data without any denoising. The network should figure it out alone; but it’s always a matter of the amount of training data.



Hello, yes I am using standard English release. What I am trying to figure out is if de-noising of a longer chunk can help me with longer audio inputs. But what I am seeing is, denoising does correct some words but it also causes a few(originally correct words) to be wrongly predicted. So overall gain in accuracy gets diminished.

Thanks. That looks like something I can try. Also, what I am thinking of is getting the longer audio chunks(that aren’t doing well) and train the model released by Mozilla using those.

How long are your longer chunks on average?

For inference, I do filtering on my input data. I experimented with denoising (RNNoise) and found it to be slow and not provide much improvement. It does improve, but normalizing led to the best improvement in accuracy. After that, high-pass and low-pass filters are nearly cost-free and showed very slight improvements as well. The order of filtering also has an effect.

1 Like

Hi, >10 sec. And the results I get are not totally wrong. It’s like:

The words enclosed in [ ] are predicted wrong, I have mentioned the correct word along side the wrong word.

[digit]try pushing any buttons on the side and see if for any light comes up on the [keep or]keypad if you hear a beep [on]tone out of it. yes there is a [bed]beep. i have light flashing.

There are other cases as well, where the words don’t make sense. But those are mostly longer audios. In shorter audios, generally it makes good sense when you read the prediction.

Hi, thanks for sharing. Can you share your code snippet, if possible?

I was surprised by how much better my WER got after I tuned my microphone. Make sure it isn’t getting distorted by being set too high. I’m on UBUNTU, and I played around with PulseEffects web RTC settings. Those can really help.

1 Like

(uses and
I ended up running through maybe 1/4 of the total possible combinations to see what worked best for me.


Thanks! Using recorded audios right now but will keep in mind when I switch to streaming.

I’ll check this out and will also post here, the approach that helped me the most.

Hello All. For users, who are working on audio denoising/noise reduction, please check this out. This denoiser released by facebook.

After the environment is set up, this is code that I used:

python -m denoiser.enhance --dns64 --noisy_dir=<path to the dir with the noisy files> --out_dir=<path to store enhanced files>

I tried it out and the noise reduction is more prominent than RNNoise, at least it was on the data I tested it on. Hope it helps!

1 Like

Yes, fbr-denoise is very good, but is also demanding high cpu resources compared to RNNoise.