Question on training data set

I am working on training/fine-tuning Deepspeech 0.7.0 (CUDA 10.0, cudnn 7.6.5, tensorflow 1.15.2, ubuntu 16.04 with Python 3.6.7) on RTX 2080 Ti 11GB 2 GPU machine.

I have a 2000+ hours of conversational dataset (American english mostly). This is conversational but sometimes repeated words are not in the transcript, as well the “you know”, “like”, “uh”, “um” etc. My question : should all of these be exactly captured in the training and validation sets to expect a low enough WER. In other words should the training data be really clean that it matches the transcript 100% for the system to be perform well.


By definition of machine learning, I fear the answer is yes.

Thanks for the response, @lissyx

Also if the recording has clapping or laughter would it better to exclude these or include them and tag in the transcript accordingly. Given that, in my real data, I don’t expect to see laughter or clapping would it be best to exclude these from training and validation sets.

Thanks a lot.

As long as the audio is intelligible, noise is good to try and make your model robust against it.

1 Like