I have selected some keyword intent. say fixed deposit and recurring deposit.
I have downloaded 2 hour video of each intent. thus i have total 4 hr voice. then i did pitch, speed and noise augmentation and thus voice data converted to 16 hours.
is it not good or sufficient for deepspeech training. i am not not building ASR on general domain. I am not not building finance in general. I am going by intent wise. is it not the right approach.
Or if you mean 20 hr is not sufficient, does that mean i need to go for more generic indian english data rather than be specific intent wise( like fixed deposit or recurring deposit).
If these 2 intents works out, i will keep on adding more financial intent voice.
is it good viable approach.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
#17
You should just do it and see. As repeated, this is highly dependant on your data and your goals. Right now, you have spent much more time asking and asking than it would have took you to run a few iterations and get actual, valuable feedback from your dataset and goals.
And just 4 hours of truly genuine material is not much, get 10 and blow it up That might get you somewhere. And lissyx is of course right, that it depends greatly on the task. If we had exact numbers we would tell you, but this is not an exact science …
Hello @lissyx and @othiele,
I have downloaded all the dependencies and successfully fine-tuned with 34 hours of Indian accent audio extracted from Youtube. There is a total of 16500 audio files, I trained on 13k, 2k, 1.5k (train, dev, test). But I didn’t get good accuracy, my WER is 0.29.
The model is giving WER 0.29 on the test set, and even it predicts some of the words wrong as compared to the pre-trained model (w/o tuning).
Am I doing something wrong?
If you need any other info, plz let me know.
EDIT: I am using DeepSpeech 0.6.1
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
#20
Please use current master for transfer learning.
That’s too vague to be actionable. When doing transfer learning you expect your model to “un-learn” and then “learn”, but obviously this could be a side effect.
Before fine-tuning, the results were too bad as it was an Indian Accent. After fine-tuning, the result improved a little as compared to pre-trained.
But some sentences I tested previously and after fine-tuning (not on trained data, I am talking about some random data), they were good on pretrained but after fine-tuning results aren’t good.
For example:
Sentence 1: Any random sentence
Sentence 2: Test Data from Indian Accent Corpus
Using Pre-Trained Model:
Sentence 1 has good results but not Sentence 2.
After Fine-Tuning on the same corpus:
Sentence 2 has good results but not Sentence 1.
According to me, after fine-tuning Sentence 1 and Sentence 2, both must be good or at least same as previously.
But that would require the source model trained on master, right?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
#23
I think the checkpoints are still compatible
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
#24
“too bad” “a little” are not really helpful. Please have proper figures, you might be tricked by examples.
According to me, you need to share more context on what you have tested. Fine-tuning requires work, maybe your learning rate is too high, maybe you need to tune dropout, maybe you need to train on more or less epochs. Have you had a look at train / dev loss evolution ?
Hi @othiele
I have shared the link of YouTube Playlist. I used the transcript provided by the channel and did some pre-processing.
And I don’t find .29 bad for 30 hours from Youtube.
Yeah, even I did some hyperparameter tuning and got WER 0.20 (plz refer my last comment).
But the problem is, it disturbed the previous weights.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
#28
So, on your YouTube-based test-set, you have 52.5% WER and 31.1% CER before fine-tuning, and 20.4% WER / 10.5% CER after fine-tuning with ~30h of data ?
That indeed does look like a very nice improvement.
Well, that’s going to be the issue you have to work on.
I suspect you want more than just those validation and test set if you want to avoid degrading quality on previous data. Otherwise, it makes sense that the new learning optimizes for the new data.
Any suggestions from your side? It would be great.
I suspect you want more than just those validation and test set if you want to avoid degrading quality on previous data.
Sorry, I didn’t get this point. Can you plz elaborate?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
#34
I don’t have your dataset, I can’t do your work there. I’ve already shared suggestions.
I don’t see how to say it otherwise: you are fine-tuning and using only one validation set, so your network is getting optimized for this one. That’s also why it regressed on previous dataset.
I guess - stating that in plain English - you could go for an even lower learning rate or put more plain English examples into the validation set so as to alter the original weights a little less