Fine-tuning DeepSpeech Model (CommonVoice-DATA)

True

Here my last try parameters :

python -u DeepSpeech.py 
--fine_tune True
--export_dir .../MyModel_TL2/ 
--summary_dir ../summaries/TL2/ 
--checkpoint_dir ../deepspeech-checkpoint 
--train_files ../data/en/clips/train.csv 
--dev_files ../data/en/clips/dev.csv 
--test_files ../data/en/clips/test.csv 
--lm_binary_path ../deepspeech_models_0.5.1/lm.binary
--lm_trie_path ../deepspeech_models_0.5.1/trie
--alphabet_config_path ../deepspeech_models_0.5.1/alphabet.txt  
--load best
--drop_source_layers 1 
--source_model_checkpoint_dir ../deepspeech-0.5.1-checkpoint/ 
--n_hidden 2048 
--early_stop False 
--train_batch_size 32 
--dev_batch_size 32 
--test_batch_size 32 
--learning_rate 0.00005
--dropout_rate 0.15
--epochs 30 
--validation_step 1 
--report_count 20 
--es_std_th 0.1 
--es_mean_th 0.1 
--es_steps 20

I train on CV and test also on CV. WER is 47.12

I remark that some parameters don’t work, did I wrote them wrong ?

Those will have an impact on the training, and you likely have to do your own search for the best fit with your data / goal

i am not familiar with a few of your flags. Anywho, like i said, the only obvious difference i see is the batchsize(mine is 64), and the rest I’ve already posted here. for training from scratch my lr is 0.0001. when you see overfitting, then cut down.

¿Why drop layers? If you are targeting the same language that the model was trained on, basically you are “losing knowledge” of the previous training.

Looks like your model will overfit, I’m doing transfer learning with 500h and even dropping 2 layers is enough with 1 epoch, with more epochs the WER increased.

Try with

–epochs 1

1 Like

thanks @carlfm01 and @alchemi5t,

Seems like I get mixed up between TL to another language and TL for the same language.

I’ll try to clean my parameters, see what is useful, what is not…

I don’t think the batchsize has an impact on precision but rather on time

This is not accurate. Check this out

After one and a half years, I come back to my answer because my previous answer was wrong.
Batch size impacts learning significantly. What happens when you put a batch through your network is that you average the gradients. The concept is that if your batch size is big enough, this will provide a stable enough estimate of what the gradient of the full dataset would be. By taking samples from your dataset, you estimate the gradient while reducing computational cost significantly. The lower you go, the less accurate your esttimate will be, however in some cases these noisy gradients can actually help escape local minima. When it is too low, your network weights can just jump around if your data is noisy and it might be unable to learn or it converges very slowly, thus negatively impacting total computation time.
Another advantage of batching is for GPU computation, GPUs are very good at parallelizing the calculations that happen in neural networks if part of the computation is the same (for example, repeated matrix multiplication over the same weight matrix of your network). This means that a batch size of 16 will take less than twice the amount of a batch size of 8.
In the case that you do need bigger batch sizes but it will not fit on your GPU, you can feed a small batch, save the gradient estimates and feed one or more batches, and then do a weight update. This way you get a more stable gradient because you increased your virtual batch size.
WRONG, OLD ANSWER: [[[No, the batch_size on average only influences the speed of your learning, not the quality of learning. The batch_sizes also don’t need to be powers of 2, although I understand that certain packages only allow powers of 2. You should try to get your batch_size the highest you can that still fits the memory of your GPU to get the maximum speed possible.]]]]

credit:-

Theoretically, you’re only right when every sample conforms to a strict set of rules and the network is guaranteed to converge to that function, which is (sweepingly) never the case.

Correct me if I am wrong to believe this is accurate.

That is interesting. I want to know what TL to another language behaves like. Because if you dont get stuck at any local minimas(unlikely), you would still converge, except maybe faster/slower. any observations?

hey @alchemi5t,

Thanks for your answer, it gets me to understand more batches
Well it was me who was wrong, didn’t see it that way. I was in the old and wrong answer of your quote :stuck_out_tongue: It is logical what it is said in your quote, a bigger batchsize will avoid to fall in a local minima.

I’ll try to go with batchsize = 64, don’t know if my machine will handle it though

I don’t get this part, what does that mean ? How do you do that ?

EDIT: That’s what I feared, my machine can’t handle batchsize 64… More reason to understand what means the quote up

Here what I get using LR 0.0001, batchsize 32, epoch 1, droplayer 0, finetuning dropoutrate 0.25, train CV, test CV :

WER = 52.96% CER =34.57% test_loss = 50.90
training_loss = 44.15
validation_loss= 44.32

I’m sure I miss something here, but I really don’t see what…

You have to code and change the pipeline for that. Instead of updating the gradients immediately, you store it, calculate for another how many ever batches(e.g., your max possible batchsize is 32, then get gradients for first batch and then instead of immediately updating weights, get the second batch’s gradients also and average both and then update weights to effectively get a batchsize of 64), average gradient and then propagate(update) it back.

Edit: I didnt see that you already got what that meant.

If I understand it well (and there is a lot of chance that I miss something or give a bad explanation), you have to drop the last layer (in order to make it sensible to your language and still use the work on the old language). You train on the last layer with your language dataset and adapt the lm.

You can also merge 2 acoustic model (i know the theory but in practice I don’t know how to do it) but it’s more for a non native accent oriented model.

Well, it was not what I expected haha, I’m reluctant to modify the code, don’t want to make some bug in it.
Even if I do this, I think there’s more to search because I’m at 52% and I saw people who can go down to 22% with CV.
I still don’t get why my loss never goes lower than 30 and yours is lower than 5…

I have a few reservations. I’ll make a new thread to discuss this. Thank you for your inputs!

ask @lissyx, he works on TL for french language if I remember well, he may explains it better than me

The only transfer learning I’m doing for french is not really the same transfer learning as discussed above.

BTW, you mention not being able to go lower than 30 of loss. Is this with CV FR ? I’m pretty sure I already shared you links to the Github issues: data inside common voice needs some love; and I have not been able to have time to do that. And so far, nobody cared.

Yeah, I remember the different issues for the french CV dataset, that’s why I use english CV for now. See it as a POC, to get use to TL, DeepSpeech and ASR.
I made few sentences in CV but same as you, don’t have a lot of time to do it…

The Common Voice dataset contains clips with errors. I’m working on building up a list of the offending clips so they can be put in the invalid clip CSV, but in the meantime, if you see a transcript that is wildly different from what it’s supposed to say, you can look up the transcript in the test CSV to get the original filename, then play it back and see if it’s correct. If not, just remove it from the CSV.

I was able to improve the WER a few points just by removing a few of the worst offenders from the test set.

Good idea, I’ll do that ! Every tips is good to take :slight_smile:

Not sure if it’ll help for my loss issue though…

Thanks a lot @dabinat !

Hello there, where did you get pre-trained model of Common Voice data?
Can you please share the link?

Hi @Rosni07 welcome to the forum :slight_smile:

You can find pre-trained models on the releases page: https://github.com/mozilla/DeepSpeech/releases

Look down until you find one that isn’t -alpha and it should contain links to the pre-trained models and other downloads / details you’ll need.

Btw this part of the forum ( #deep-speech ) is associated with that repo, so it’s generally worth you having a look through the repo before posting for simple stuff like this

I already checked on the released models page but the recently released model was trained on American English and I was wondering if there is one on “CommonVoice dataset” ?