Transfer learning by training only the last few layers

(Rpratesh) #1


I wanted to train the provided deepspeech 0.3 checkpoints with my own recorded audio files.
For now, I want to train only the last few layers (like fully connected layers and last few RNN layers) while freezing the rest of layers. What changes are needed in code to do that.

Also, for an advice, Is it better to train only the last few layers or the entire network, provided your dataset is just another accent of English language itself with few extra words that were not earlier in vocabulary.


(Rpratesh) #2

Any suggestions on this @lissyx @Tilman_Kamp !

(Lissyx) #3

Joshua is already working on that, but it’s still early:

(Rpratesh) #4

@lissyx Can you tag Joshua’s handle in this thread. So that we can have discussion on this topic here


Hey there, @rpratesh!

Try checking out that branch that @lissyx mentioned (transfer-learning).

You will need to add the following files for your new accent:


(1) Use util/ to find the unique characters in your new accent transcripts.

(2) Use util/ from the transfer-learning branch to generate the trie & lm.binary files

(3) train with, making sure the paths match up to your local data. I’d try replacing the top layer and maybe the second layer. Going down deeper probably won’t make sense for just accent-transfer.

(Rpratesh) #6

I’ve checked out “transfer-learning” branch and generated my own lm.binary, trie files.
I’ve trained with two extra parameters i.e.

–drop_source_layers 2 --source_model_checkpoint_dir “$MyCkptDir”

Are the above two parameters correct and enough, for the purpose of transfer learning by unfreezing only the last two layers

And, the model seems to have converged quickly and is performing well with my test dataset. Though, I’ve a feeling that it has over-fitted, because it works well with similar sentences that I’ve added to vocabulary (though from a completely new speaker) , but doesn’t work well if the same speaker talks entirely new sentences.
Any suggestion on what I can do to improve this.


Try early stopping with a shorter window

(Rpratesh) #8

hi @josh_meyer

If we look into Deepspeech’s architecture, it has 3 FC layers, then one BiLSTM layer followed by one FC layer.
So, If I am training last two layers using the Transfer-learning branch, that @josh_meyer has mentioned, would that modify the weights of even the BiLSTM layer ('coz that’s the last-but-one layer as per the arch.)

Also @josh_meyer @lissyx @kdavis,
Can you suggest any better training method such that it retains the American accent but performs well on other accents too.
Problem I am facing is that, If I train the last one/two layers with just Indian accent, it’s forgetting the previous accents.

(Rpratesh) #9

any help regarding this query!

(Reuben Morais) #10

The terminology in the transfer learning branch includes the output layer in the count, so last two means the output layer and final hidden layer.

(Rpratesh) #11

Output layer means CTC decode layer?

(Reuben Morais) #12

It means the last layer in the model, the one that has the same size as your alphabet.