Hi !! Firstly I want to show my great thanks to the Mozilla DeepSpeech team.
I already got some good results by just following the training guild with my own data. I’m really grateful for everything the team provided
My question is,
For a well-trained model, is it possible for the ASR system ( or to be more precise, the acoustic model ) to become immune to different audio levels (loudness) when running inference ?
I’ve been tracing the codes for a while, and there seems no audio normalization steps (e.g. peak normalization) during both training and inference time ( sorry if I missed it )
Is it usually done by doing volume augmentation and leave the rest for the model training (e.g. to lower the weight of the 0th MFCC as mentioned in here) just like the way we make model more robust to noisy background ?
It says
Generally the first MFCC coefficients is obtained by fitting the constant value curve (cos(0)) to your log-energy filter banks. Therefore it is highly correlated to the RMS energy of your signal. If your remove that coefficient (often called ‘static’) then in theory you make your model volume (gain) independent.
I just want to make sure I didn’t get it wrong,
Thanks in advance