Hi,
first of all - big thanks to the Deepspeech team. Amazing work.
I am working on creating dataset of polish speech. I am using aeneas FA tool to get alignments from audiobooks (ofc, public domain). I can’t afford to manually check and finetune all produced alignments. From random sampling my dataset, i assmue there is about 1% really bad alignments (audio doesn’t match transcript at all) and 12% almost perfect alignments (for example last word of transcript is cut off slightly), rest is good quality (audio matches transcript 100%).
My question is - can i just ignore bad quality utterances, feed them to DS with the rest of the dataset and hope that small proportion of bad quality samples will not disturb overall training convergance?