Hi DeepSpeech experts,
I’m interested in building an ASR model that works well with spontaneous conversation in environments with some background noise.
I understand that the current DeepSpeech pretrained model (0.6.1) is trained with relatively clean audio and no noise added.
In the near term, I might be able to label ~1 hour of data from my domain for transfer learning. I fear that will not be sufficient to achieve good fine-tuning. Consequently, I would like to train a base model on the Common Voice dataset with data augmentation, including gaussian noise.
I see there’s already a training parameter that adds gaussian noise, “data_aug_features_additive”. However, I don’t know what a reasonable value would be. Is there any rule of thumb, natural value, or suggestion based on past experience to assign this on a first training run?
And, a related question. If I train a model and the ASR accuracy is somewhat low on my dataset, any suggestions on how I can know whether the value of data_aug_features_additive was set too high or too low? I can see either value having a negative effect on training…
Thanks in advance for any help you can provide. =D