Clip Data for Training

I am trying to train a new set in Greek, my clip data consist of more than 25.000 wav files of approximately 55sec each one.
The clips are taken from radio and TV interviews and shows, so on the background there is a lot of music and there are alternations between the speakers (interviews or conversations).
I order to adjust the clips on the optimum 20sec length that deepspeech requires I have started to cut then on various lengths from 5sec to 25sec.
In order to check the result I have trained a new model with approximately 1.500 wav files and with a Kenlm scorer from the transcripts of those files (the 1.500).
On the result when I give a try on a new file to get the transcript, it might recognize some words, but the result is really poor, but off course if I try a file from the ones used on the training the result overcomes 90% of accuracy.

I have also trained another set with clips recorded from individuals (doctors), without any back noise, music or other people talking. The result is perfect and even if I pass a new recording to get the transcript the accuracy is more than 90%.

My question is if the data that I should use for the training should be clear from any noise, music or other interference in order to have a good training set?

Then on the scorer do I have to use the same transcript taken from the training data or it should include more data?

1 Like

Modern training procedures artificially add noises to improve generalization. So ideally you want to have both clean data and noisy data doesn’t hurt in particular if you have domain-specific noises.

It should include as much text data as possible. It also improves generalization.

Actually on the set (the one with the background noises) I have added augmentation to add extra noise.
Will this affect the result further?

Yes, of course. Probably with a positive effect.