As my title said, I’m looking to do some transfer learning from the english model to a model responsive to non native english (ex : a french native speaking english).
I saw that there is a branch called transfer-learning2 which is used for transfer learning in most posts about TL here. But what about all the other branches such as transfer-learning-en ? Are they deprecated or should I use these ones ?
Second question: I though of using CV dataset first to get good WER and be more familiar with TL and after using AMI corpus to get my model responsive to non native speech, is it a good idea ? If so, is there someone who worked on an importer for AMI ?
Thanks a lot
PS: If it’s not clear for you, don’t hesitate to ask !
Edit: I kind of understand that the pre-trained model was trained on librispeech but i read some talking about training with CV, which one is true ?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Thanks for sharing ! I read it yeah
It seems to me that create an ASR System for french native and non native might be problematic with the little amount of data. Ideally, I’d love to work on french speaker with different mother tongues such as arabic, spanish, polish,… But right now, it’s difficult to do so with my timeline…
I’ll more likely develop an ASR System for english native and non native.
That’s why I said “french native speaking english”, meaning an english speaker which mother tongue is french. Hence the idea of using the AMI corpus.
I’m sure I’m not the first one who thought of doing this but, like this topic, I can’t find any answer which satisfies me.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
Well there’s no better solution than getting more broader accents to contribute to Common Voice french
Okay, then I misunderstood, and I thought you meant “tranfer learning from english to non native” as in “re-use english model to train a french model” (which is covered by the docker I shared).
I’m not sure what you can achieve using french common voice, because you will get people speaking french.
Then I guess your problem is indeed simpler, you just need an importer for AMI dataset and then you can transfer-learn by re-trainig from checkpoints ?
Well that’s the idea I prioritize yeah.
And CV dataset have also some of non native speaker (referred as “accent” in .tsv file) so I’ll try to see what are the result after re-training with it from checkpoints of v0.5.1.
I never did TL before so I’m looking into topics and doc on it to see what are the best parameters to do it (freeze all layers except the last one ? modify dropout_rate ?, …)
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
there are, but I fear not enough for your usecase yet.
@josh_meyer is the one who worked on that topic, but he was more focusing on the use-case I described: learning a new language re-using the english model, and not fine-tuning for just an accent. So I don’t know how much his branch could apply to your use.
That being said, I remember his branch only retrains a few layers.
Yes, I read that you have a flag where you can choose with layer you retrain. I tried it with a comparison original Model VS Re-trained model tested on CV dataset. Not a great improvement after few epochs (training stop early because of not enough evolution of loss)… I don’t know yet if it’s because I mess up with my parameters or if it’s just that i did not enough training.
I’ll try again re-training from checkpoints of my last retraining !
Don’t know if it’s useful to you but there is a dataset which african accent french speaker openslr.org
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
It’s always useful to know about that, I’m obviously not spending enough time on OpenSLR website
If I have the idea to start from an importer like the librivox importer.Don’t know where it’ll lead and how much time it’ll take though…
Does the fact that each records of AMI corpus is ~1h30 long a problem ? I saw that it was an issue in V0.3
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
13
Yes, it’s going to be way to big. We limit ourselves in current importers to 10-15 secs max, to balance between batch size and GPU memory usage.
I have one last question, may be dumb but I have to know may be you or @josh_meyer can help me.
If I use TL, retraining only the last layer from v0.5.1 checkpoint , stops, test my model and restart retraining only the last layer. What checkpoints should I use ? Do I have to restart from v0.5.1 checkpoint ?
Or can I start from my retrained checkpoint ? If so, should I remove the --drop_source_layers 2 flags ?
Hmmm… What’s my solution then ? Find an other dataset or try to create part of 10-15secs from this with the risk to cut sentence in two ?
Is it possible with 35s ? I have an other data set with the same sentence pronounced by a lot of different accented people but the sentence is quite long
To use this branch, you will need to add and read the following params:
--fine_tune It will fine-tune the transfered layers from source model or not
--drop_source_layers single integer for how many layers to drop from source model (to drop just output == 1, drop penultimate and output ==2, etc)')
--source_model_checkpoint_dir The path to the trained model, it will load all the layers and drop the specified layers
Thing you can’t do with the current branch, fine tune specific layers, drop specific layer and freeze specific layers.
For the following results I only drop the last layer and fine tune the others.
Total hours
LR
Dropout
Epochs
Mode
Batch Size
Test set
WER
500
0.00012
0.24
1
Transfer Learning
12
Train-es-common voice
27%
500
0.000001
0.11
2
Transfer Learning
10
Train-es-common voice
46%
500
0.0001
0.22
6
From scratch
24
Train-es-common voice
50%
For 500h 1 epoch seems enough dropping the last layer and fine tuning the other ones.
As @lissyx mentioned I think you way to go is just fine tune the existing model with your data using a very low lr like “0.000001”
The transfer learning approach I feel is to solve the issue with different alphabets.
Yes, 500h for training then test on common voice Spanish set, almost all my training set is from the same domain, so I better use common voice set to avoid biased results.