The immunity of the model with respect to different audio levels

hawa · December 9, 2020, 9:02am

Hi !! Firstly I want to show my great thanks to the Mozilla DeepSpeech team.
I already got some good results by just following the training guild with my own data. I’m really grateful for everything the team provided

My question is,
For a well-trained model, is it possible for the ASR system ( or to be more precise, the acoustic model ) to become immune to different audio levels (loudness) when running inference ?

I’ve been tracing the codes for a while, and there seems no audio normalization steps (e.g. peak normalization) during both training and inference time ( sorry if I missed it )

Is it usually done by doing volume augmentation and leave the rest for the model training (e.g. to lower the weight of the 0th MFCC as mentioned in here) just like the way we make model more robust to noisy background ?

It says

Generally the first MFCC coefficients is obtained by fitting the constant value curve (cos⁡(0)) to your log-energy filter banks. Therefore it is highly correlated to the RMS energy of your signal. If your remove that coefficient (often called ‘static’) then in theory you make your model volume (gain) independent.

I just want to make sure I didn’t get it wrong,
Thanks in advance

Topic		Replies	Views
Simple tweak to improve DeepSpeech on accents DeepSpeech learning , feedback	2	601	January 6, 2021
Availability of pre-trained models DeepSpeech	22	1815	November 12, 2019
Running inference on long audio files (30-45 minutes) sampled at 44.1kHz with DeepSpeech 0.7.0 DeepSpeech	8	1989	May 10, 2020
Training and validation loss increases and transcription worsen when added few hours of new audio data of same environment DeepSpeech	8	649	November 20, 2019
Does anyone have any reasonable intutition about the normalization method used for the spectrograms in TTS training? TTS (Text-to-Speech)	3	1206	March 17, 2020

The immunity of the model with respect to different audio levels

Related topics