Reasonable values for data augmentation

Hi DeepSpeech experts,

I’m interested in building an ASR model that works well with spontaneous conversation in environments with some background noise.

I understand that the current DeepSpeech pretrained model (0.6.1) is trained with relatively clean audio and no noise added.

In the near term, I might be able to label ~1 hour of data from my domain for transfer learning. I fear that will not be sufficient to achieve good fine-tuning. Consequently, I would like to train a base model on the Common Voice dataset with data augmentation, including gaussian noise.

I see there’s already a training parameter that adds gaussian noise, “data_aug_features_additive”. However, I don’t know what a reasonable value would be. Is there any rule of thumb, natural value, or suggestion based on past experience to assign this on a first training run?

And, a related question. If I train a model and the ASR accuracy is somewhat low on my dataset, any suggestions on how I can know whether the value of data_aug_features_additive was set too high or too low? I can see either value having a negative effect on training…

Thanks in advance for any help you can provide. =D

Aha, one follow-up question. There’s another noise parameter data_aug_features_multiplicative. Should I be using this noise feature as well, and if so is there a recommended value to give?

I honestly don’t have a great intuition for what causes additive versus multiplicative noise, nor which type of noise is more prevalent in my data.

1 Like

Hi @craklyn

Is there any good dataset for spontaneous speech, either open or commercial? I think the pretrained model is more biased on read speech.