I am trying to train a new set in Greek, my clip data consist of more than 25.000 wav files of approximately 55sec each one.
The clips are taken from radio and TV interviews and shows, so on the background there is a lot of music and there are alternations between the speakers (interviews or conversations).
I order to adjust the clips on the optimum 20sec length that deepspeech requires I have started to cut then on various lengths from 5sec to 25sec.
In order to check the result I have trained a new model with approximately 1.500 wav files and with a Kenlm scorer from the transcripts of those files (the 1.500).
On the result when I give a try on a new file to get the transcript, it might recognize some words, but the result is really poor, but off course if I try a file from the ones used on the training the result overcomes 90% of accuracy.
I have also trained another set with clips recorded from individuals (doctors), without any back noise, music or other people talking. The result is perfect and even if I pass a new recording to get the transcript the accuracy is more than 90%.
My question is if the data that I should use for the training should be clear from any noise, music or other interference in order to have a good training set?
Then on the scorer do I have to use the same transcript taken from the training data or it should include more data?