I make a dataset for russian language, and, due to obvious lack of data, I search for possibilities of extending the dataset in a relatively simple, feasible for one person way. Now I have 300 hours of accurate (~97%) transcribed speech.
I have multiple ideas:
1st: Will it be good and to what extent to add let’s say 100 hours of speech taken from youtube along with supplied auto-create subtitles? (transcription will naturally lack accuracy quite a bit)?
2nd: This one is connected to the 1st. Let’s suppose I will use ground-truth transcription (let’s say youtube subtitles, but handwritten, accuracy around 95-99.9%), and will cut the audio automaticly, based to time-marks supplied, which are not accurate all the time. Then what I end up with is something 80% good dataset, sometimes cut too late (with the early beginning of the next phrase), sometimes too early (before the phrase is ended, e.g. part of the last word in the phrase is said incompletely).
Is it a good practice to add that to the clean dataset? Will it lead to better performance, or should I possible avoid these methods?
P.S. I’ve tried the augmentation with noise, different speed, pitch… I’ve made ~950 Hours from 300 Hours, but it seems only reducing the model performance. I used and use now 2048 n_hidden, 32 train_batch, 0.0001 lr, 0.15 dropout rate, lm_alpha 0.75. Are these parameters good for 300 Hours of speech? Should I reduce n_hidden to 1024 or so? The best devel loss I get is arround 29-30 on augmented 900 H ds and ~20 loss on clean one.
Any and all help is much appreciated!