Non native english with transfer learning from V0.5.1 Model, right branch, method and discussion

caucheteux · July 3, 2019, 1:41pm

Hello,

As my title said, I’m looking to do some transfer learning from the english model to a model responsive to non native english (ex : a french native speaking english).
I saw that there is a branch called transfer-learning2 which is used for transfer learning in most posts about TL here. But what about all the other branches such as transfer-learning-en ? Are they deprecated or should I use these ones ?

Second question: I though of using CV dataset first to get good WER and be more familiar with TL and after using AMI corpus to get my model responsive to non native speech, is it a good idea ? If so, is there someone who worked on an importer for AMI ?

Thanks a lot

PS: If it’s not clear for you, don’t hesitate to ask !

Edit: I kind of understand that the pre-trained model was trained on librispeech but i read some talking about training with CV, which one is true ?

lissyx · July 3, 2019, 11:30am

It would be great that you join forces: https://discourse.mozilla.org/c/voice/fr https://github.com/Common-Voice/commonvoice-fr/issues

As much as I can tell, they are deprecated.

No, what is the AMI dataset ?

Common Voice, as you can see on the french training data shared earlier, is not yet enough.

caucheteux · July 3, 2019, 12:10pm

thanks for your quick answer @lissyx !

AMI corpus is a dataset used for non native recognition http://www.openslr.org/16/

what dataset should i used then ?

lissyx · July 3, 2019, 12:10pm

Have you read the links I shared ?

lissyx · July 3, 2019, 12:12pm

@caucheteux https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train.fr

caucheteux · July 3, 2019, 12:35pm

Thanks for sharing ! I read it yeah
It seems to me that create an ASR System for french native and non native might be problematic with the little amount of data. Ideally, I’d love to work on french speaker with different mother tongues such as arabic, spanish, polish,… But right now, it’s difficult to do so with my timeline…

I’ll more likely develop an ASR System for english native and non native.
That’s why I said “french native speaking english”, meaning an english speaker which mother tongue is french. Hence the idea of using the AMI corpus.

I’m sure I’m not the first one who thought of doing this but, like this topic, I can’t find any answer which satisfies me.

lissyx · July 3, 2019, 12:39pm

Well there’s no better solution than getting more broader accents to contribute to Common Voice french

Okay, then I misunderstood, and I thought you meant “tranfer learning from english to non native” as in “re-use english model to train a french model” (which is covered by the docker I shared).

I’m not sure what you can achieve using french common voice, because you will get people speaking french.

Then I guess your problem is indeed simpler, you just need an importer for AMI dataset and then you can transfer-learn by re-trainig from checkpoints ?

caucheteux · July 3, 2019, 12:47pm

That’s what I thought when I read your docker

Well that’s the idea I prioritize yeah.
And CV dataset have also some of non native speaker (referred as “accent” in .tsv file) so I’ll try to see what are the result after re-training with it from checkpoints of v0.5.1.

I never did TL before so I’m looking into topics and doc on it to see what are the best parameters to do it (freeze all layers except the last one ? modify dropout_rate ?, …)

lissyx · July 3, 2019, 12:52pm

there are, but I fear not enough for your usecase yet.

@josh_meyer is the one who worked on that topic, but he was more focusing on the use-case I described: learning a new language re-using the english model, and not fine-tuning for just an accent. So I don’t know how much his branch could apply to your use.

That being said, I remember his branch only retrains a few layers.

caucheteux · July 3, 2019, 1:13pm

That’s what I though too…

Yes, I read that you have a flag where you can choose with layer you retrain. I tried it with a comparison original Model VS Re-trained model tested on CV dataset. Not a great improvement after few epochs (training stop early because of not enough evolution of loss)… I don’t know yet if it’s because I mess up with my parameters or if it’s just that i did not enough training.
I’ll try again re-training from checkpoints of my last retraining !

Don’t know if it’s useful to you but there is a dataset which african accent french speaker http://www.openslr.org/57/

lissyx · July 3, 2019, 1:17pm

It’s always useful to know about that, I’m obviously not spending enough time on OpenSLR website

caucheteux · July 3, 2019, 1:32pm

If I have the idea to start from an importer like the librivox importer.Don’t know where it’ll lead and how much time it’ll take though…
Does the fact that each records of AMI corpus is ~1h30 long a problem ? I saw that it was an issue in V0.3

lissyx · July 3, 2019, 1:38pm

Yes, it’s going to be way to big. We limit ourselves in current importers to 10-15 secs max, to balance between batch size and GPU memory usage.

caucheteux · July 3, 2019, 1:38pm

I have one last question, may be dumb but I have to know may be you or @josh_meyer can help me.

If I use TL, retraining only the last layer from v0.5.1 checkpoint , stops, test my model and restart retraining only the last layer. What checkpoints should I use ? Do I have to restart from v0.5.1 checkpoint ?
Or can I start from my retrained checkpoint ? If so, should I remove the --drop_source_layers 2 flags ?

caucheteux · July 3, 2019, 1:51pm

Hmmm… What’s my solution then ? Find an other dataset or try to create part of 10-15secs from this with the risk to cut sentence in two ?

Is it possible with 35s ? I have an other data set with the same sentence pronounced by a lot of different accented people but the sentence is quite long

caucheteux · July 3, 2019, 1:42pm

Continuing the discussion from Non native english with transfer learning from V0.5.1 Model, right branch, method and discussion:

Edit title as the discussion is more than just the choice of the right branch for transfer learning

lissyx · July 3, 2019, 2:43pm

I guess with VAD and force-alignment you should get something. We have @nicolaspanel who contributed ~182h of LibriVox as TrainingSpeech this way.

It might still be low batch size, even though it fits into GPU RAM.

Honestly I have not tested this branch for long.

carlfm01 · July 4, 2019, 5:01am

Hello @caucheteux

Here’s my insight that can be useful for you of my Spanish model.

FYI here’s the branch used for the tests, is just a few days behind the current master of DeepSpeech: https://github.com/carlfm01/DeepSpeech/tree/layers-testing

To use this branch, you will need to add and read the following params:

--fine_tune It will fine-tune the transfered layers from source model or not 

--drop_source_layers  single integer for how many layers to drop from source model (to drop just output == 1, drop penultimate and output ==2, etc)')

--source_model_checkpoint_dir The path to the trained model, it will load all the layers and drop the specified layers

Thing you can’t do with the current branch, fine tune specific layers, drop specific layer and freeze specific layers.

For the following results I only drop the last layer and fine tune the others.

Total hours	LR	Dropout	Epochs	Mode	Batch Size	Test set	WER
500	0.00012	0.24	1	Transfer Learning	12	Train-es-common voice	27%
500	0.000001	0.11	2	Transfer Learning	10	Train-es-common voice	46%
500	0.0001	0.22	6	From scratch	24	Train-es-common voice	50%

For 500h 1 epoch seems enough dropping the last layer and fine tuning the other ones.

As @lissyx mentioned I think you way to go is just fine tune the existing model with your data using a very low lr like “0.000001”

The transfer learning approach I feel is to solve the issue with different alphabets.

caucheteux · July 4, 2019, 8:07am

Thanks Carlos !

It helps a lot ! I’ll test some of your configs

Here you talk about 500hours of data for your training and after, you test on Common-Voice data ? Just to be sure I understand it well

thanks again, keep you in touch

carlfm01 · July 4, 2019, 8:20am

Yes, 500h for training then test on common voice Spanish set, almost all my training set is from the same domain, so I better use common voice set to avoid biased results.