You have to code and change the pipeline for that. Instead of updating the gradients immediately, you store it, calculate for another how many ever batches(e.g., your max possible batchsize is 32, then get gradients for first batch and then instead of immediately updating weights, get the second batch’s gradients also and average both and then update weights to effectively get a batchsize of 64), average gradient and then propagate(update) it back.
Edit: I didnt see that you already got what that meant.
If I understand it well (and there is a lot of chance that I miss something or give a bad explanation), you have to drop the last layer (in order to make it sensible to your language and still use the work on the old language). You train on the last layer with your language dataset and adapt the lm.
You can also merge 2 acoustic model (i know the theory but in practice I don’t know how to do it) but it’s more for a non native accent oriented model.
Well, it was not what I expected haha, I’m reluctant to modify the code, don’t want to make some bug in it.
Even if I do this, I think there’s more to search because I’m at 52% and I saw people who can go down to 22% with CV.
I still don’t get why my loss never goes lower than 30 and yours is lower than 5…
The only transfer learning I’m doing for french is not really the same transfer learning as discussed above.
BTW, you mention not being able to go lower than 30 of loss. Is this with CV FR ? I’m pretty sure I already shared you links to the Github issues: data inside common voice needs some love; and I have not been able to have time to do that. And so far, nobody cared.
Yeah, I remember the different issues for the french CV dataset, that’s why I use english CV for now. See it as a POC, to get use to TL, DeepSpeech and ASR.
I made few sentences in CV but same as you, don’t have a lot of time to do it…
The Common Voice dataset contains clips with errors. I’m working on building up a list of the offending clips so they can be put in the invalid clip CSV, but in the meantime, if you see a transcript that is wildly different from what it’s supposed to say, you can look up the transcript in the test CSV to get the original filename, then play it back and see if it’s correct. If not, just remove it from the CSV.
I was able to improve the WER a few points just by removing a few of the worst offenders from the test set.
The pretrained model doesn’t give accurate result maybe it is because of the differences between the accent and pronunciation of trained data (American English) and my voice(Female Asian-Indian accent). So, I was wondering if there is any pre-trained model of CommonVoice data.
If you’re trying to improve it specifically for Indian English some fine tuning might help (I know others on here have been looking at that for Indian accents but I don’t know how they’ve got on). Another approach to consider would be including Indian sourced text in the LM, since that could help it cope with “Indianisms” that aren’t typically part of the American / British English data that likely make up the bulk of the LM data.
Sadly, that’s very much likely your issue. Even with the Common Voice data, the amount and the diversity available right now is not yet enough to provide a noticeable increase for non-american english.
I haven’t tested at wide range. I did some minor testing like “hello”, “this is testing file”, etc result wasn’t accurate for those small words and I had one more problem/query. The model generated text for the noise that came along with audio.
For eg: “1 sec pause and saying hello” gave “and hello”