The noise reduction componet in the pre-transcription processing


I’m was thinking about to implement noise reduction component before the actual transcription stage. I’m curious if this will improve the accuracy, since the audio will be cleaner. Has anyone already did it?
Did you implement something like a low-pass filter or something similar? And how this impact the transcription?

Is this something like a good practise in Speech Recognition or rather not?

Thank you in advance!

I did a quick test with rnnoise - did improve the result a bit, but still highly depends on the Input signal.


I ran tests on my own data using filters in various combinations. The most effective overall by itself was normalizing. High pass and low pass filters also showed a small but consistent improvement. RNNoise showed improvement, but not as consistently. It was also added the most latency of the four by far.


Thanks guys @baconator and @dkreutz for your answers!

I was wondering which libraries did you use @baconator for the normalization and low/high pass filters. I would be glad if you could tell me. Thank you in advance!

I was trying with the pydub.
The normalization improved slightly the quality of the audio and therefore the transcription.
However I’m not sure how high the cutoff should be for low pass and high pass filtering? How much did you specify? I have 16kHz sample rate audio, I suppose you too ( unless you convert it to 8kHz for the telephony… )Can you tell me which value of this parameter did you use @baconator ?

You can safely do a low-cut/high-pass at 50Hz (average human voice can not get that low) or even a little bit higher at 75Hz. That will remove rumble noise (from mic handling etc.)
The 16Khz sample rate gives you 8000Hz as highest frequency (see Nyquist-Shannon-Theorem), so that is the top limit for high-cut/low-pass. Depending on your use case you can try lowering it to 7000Hz.

Regarding filter slope try 6dB/octave or 12dB/octave, higher values might result in undesired boost around the cut-off frequency.


Yeah, pydub/numpy. I cut off at 100hz and 3000hz.


Hi @dkreutz,
thank you for the answer, what do you mean by that? where can I specify these values?

That depends on the audio tool you are using for filtering. The highpass/lowpass filter of sox have a default slope of 6dB.


@dkreutz thanks! I will check sox library.
untill now I used pydub and it is working good for normalization.
But with high/low pass filtering I’m still struggling.
In pydub signal is reduced by 6dB per octave below the cutoff point in highpass or in low pass above the cutoff point. However, i still hear the noise in the audio saved after filtering

Cutoff Range: 100Hz- 3000Hz

During filtering, first I use lowpass and highpass filters and after that I do normalization. I hope this order is fine.

pydub should be fine for filtering. Regarding the processing order: it doesn’t matter if you first do low- or highpass, normalizing should be the last step, though.

1 Like

Thanks for sharing this amazing information.

I have been using this with rnnoise added to the preprocessing pipeline and it works pretty well for me.
Can provide initial improvement results on real world audio meeting data.

Thanks for sharing your approaches. It did motivate me to run a benchmark for comparison:

In my experiments frequency filtering only had a very small impact. The noise reduction (with rnnoise) did help much, but also can lower the accuracy in more silent environments.

Note that the benchmark did not test transcription accuracy directly, because I’m doing an additional step afterwards (Speech -> Text -> Intent+Slots).

The benchmark code can be found here

Update: The new model version (0.9) has a much better accuracy in noisy environments due to the noise augmentations in training. Extra noise reduction now decreases the accuracy while frequency filtering does increase it a little in very noise environments.


I’ve been looking for a place to share this :

I recently found out about it, and didn’t want the information to get lost in case it can be useful, or anyone thinks this deserves evaluation.

The project was funded by the Prototype Fund, a funding program of the Federal Ministry of Education and Research [BMBF] that is managed and evaluated by the [Open Knowledge Foundation Germany].
See :

To give a little more context, I only discovered Common Voice a couple of days ago, and I don’t know about the technical aspects of voice recognition, AI, ML or noise filtering. My mind just went “wait, there’s this vaguely related project you stumbled upon recently, maybe that could help”.

So, here it is :slightly_smiling_face:


Thanks @rouelibre

Looks like one of the authors of Noize (Nolze?) forked it and extended with several other sound features: