I need help/ideas about what I can use for de-noising audios so I can improve the model accuracy. I have already tried RNNoise, model output for some audios gets better and for some it gets worse(which is not helping). So, can you guys suggest anything else that I can do? It’ll be really helpful.
Following are the details I’m working with:
Acoustic Model : 0.7.1 released by Mozilla
Scorer : Custom
Data : Conversational data. Customer support.
I am splitting the audio call into smaller chunks using VAD and those audio chunks are then fed to the model. What I’ve observed is that the model does good when the audio duration is not very high. So I am hoping that if I can de-noise, that will probably help me more with longer audio files but I am not sure where to start. Thanks!
Just an idea but you could denoise all your training data, as sort of data augmentation.
Then you retrain with both, original and denoised data.
You should do the same with all new incoming data.
This trains your model to be robust to both, original and cleaned data.
If you have already enough labelled noisy data, you could also try just training with the noisy data without any denoising. The network should figure it out alone; but it’s always a matter of the amount of training data.
Hello, yes I am using standard English release. What I am trying to figure out is if de-noising of a longer chunk can help me with longer audio inputs. But what I am seeing is, denoising does correct some words but it also causes a few(originally correct words) to be wrongly predicted. So overall gain in accuracy gets diminished.
Thanks. That looks like something I can try. Also, what I am thinking of is getting the longer audio chunks(that aren’t doing well) and train the model released by Mozilla using those.
For inference, I do filtering on my input data. I experimented with denoising (RNNoise) and found it to be slow and not provide much improvement. It does improve, but normalizing led to the best improvement in accuracy. After that, high-pass and low-pass filters are nearly cost-free and showed very slight improvements as well. The order of filtering also has an effect.
Hi, >10 sec. And the results I get are not totally wrong. It’s like:
The words enclosed in [ ] are predicted wrong, I have mentioned the correct word along side the wrong word.
[digit]try pushing any buttons on the side and see if for any light comes up on the [keep or]keypad if you hear a beep [on]tone out of it. yes there is a [bed]beep. i have light flashing.
There are other cases as well, where the words don’t make sense. But those are mostly longer audios. In shorter audios, generally it makes good sense when you read the prediction.
I was surprised by how much better my WER got after I tuned my microphone. Make sure it isn’t getting distorted by being set too high. I’m on UBUNTU, and I played around with PulseEffects web RTC settings. Those can really help.