Forced alignment and train data quality

Hi,

first of all - big thanks to the Deepspeech team. Amazing work.

I am working on creating dataset of polish speech. I am using aeneas FA tool to get alignments from audiobooks (ofc, public domain). I can’t afford to manually check and finetune all produced alignments. From random sampling my dataset, i assmue there is about 1% really bad alignments (audio doesn’t match transcript at all) and 12% almost perfect alignments (for example last word of transcript is cut off slightly), rest is good quality (audio matches transcript 100%).

My question is - can i just ignore bad quality utterances, feed them to DS with the rest of the dataset and hope that small proportion of bad quality samples will not disturb overall training convergance?

I’ve experienced something like that. I will explain what I did:
From a spanish dataset of around 350 hours of train, 30 of them weren’t perfect (in terms of aligment). So that’s less than 9% of bad-quality data. Then I trained with that dataset (2048 n_hiddens, 0.2 dropout) and I got around 10% CER, and a few samples that were incorrectly labeled in the test, were correctly labeled by my model!!

So, I think you should give it a shot to it and while is training, take care of you dataset. Just think that there’s no perfect dataset. It’s impossible to check thousands of speech data.

1 Like