I am working on training/fine-tuning Deepspeech 0.7.0 (CUDA 10.0, cudnn 7.6.5, tensorflow 1.15.2, ubuntu 16.04 with Python 3.6.7) on RTX 2080 Ti 11GB 2 GPU machine.
I have a 2000+ hours of conversational dataset (American english mostly). This is conversational but sometimes repeated words are not in the transcript, as well the “you know”, “like”, “uh”, “um” etc. My question : should all of these be exactly captured in the training and validation sets to expect a low enough WER. In other words should the training data be really clean that it matches the transcript 100% for the system to be perform well.
Thanks
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
By definition of machine learning, I fear the answer is yes.
Also if the recording has clapping or laughter would it better to exclude these or include them and tag in the transcript accordingly. Given that, in my real data, I don’t expect to see laughter or clapping would it be best to exclude these from training and validation sets.
Thanks a lot.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
As long as the audio is intelligible, noise is good to try and make your model robust against it.