Training a model for a low resource language (trying to incorporate DSAlign)

I’m trying to get to the point where I can point DSAlign at a new dataset and automatically get new training examples. To start with, i’m finding DSAlign performs poorly unless DeepSpeech does a very good job on producing transcripts. So it’s almost as if DSAlign is only useful for those who already have a great working model to begin with, the chicken and egg problem is especially frustrating because producing training data by hand is soul destroying. I was hoping DSAlign could help by making the process more or less semi-supervised but it seems unless I hand transcribe about 7hrs per speaker, DeepSpeech doesn’t work well enough for DSAlign to work effectively. Am I missing something or is this a fair assessment?

Would be interested in trying some forced alignment solutions that don’t depend on ASR to bootstrap the process for a low resource language.

Part of the reason I believe DSAlign is doing a terrible job is because the VAD library seems to be generating fragments that are too short. If they are longer it seems DeepSpeech has a better chance to use the language model to come up with a better guess at the transcription. If the fragment is short the noise seems to dominate the transcription and DSAlign just seems helpless at that point. I’m already using the aggressiveness = 0 option so short of switching to another library for segmentation (i think the Audacity algorithm does a brilliant job!), have been thinking of combining shorter fragments into surrounding fragments before feeding them into the ASR.

Another idea i’ve just thought of is having DeepSpeech keep a memory of recent inferences which could help give a context to make the next inference a better one. Would it be difficult to make such a change? Or if we could give inference a text argument which probably contains the transcription in the audio. I would love to try these ideas out but I don’t know where to start.

I think the best solution to alignment would utilise context on the transcription phase by generating transcripts for fragments before (B) and after (A) the current (C) fragment. Then use those results and a rough idea of where the surrounding text S is (estimate it) to give the most probable transcription T=t given A,B,C, ie t which maxises P(T=t|A,B,C,S).