Model is inferencing bad after finetuning

Hello,

I did finetune deepspeech english model with my custom data (5-7 sec audio, 3 hours of audio in general, most of them was an machine generated data thorugh data augmentation).
First I trained for 3 epochs (lr = 0.0001, batch_size =16, without dropout)
The loss function was getting lower till the value of 3, then I tested it and the result was really bad , much worst then before finetuning. I tried to fine tune again for 10 epochs with the same hyperparameter and the loss function lowered till 0.8 , but the prediction was still really really bad… I want to also mention that I did not get any warnings during the finetuning. Overfitting?

I did the same experiments with german data and the transcription improved after finetuning, even though I used a smaller data set. The loss funtion was about 60.

Could someone give me an advise or tell me what could go wrong? should i use dropout= 0.4 ?

Not much, but you don’t state what you use as the basis …

Dropout is always on your side.

Did you search for learning rates in fine tuning or transfer learning?

That is your job.

You don’t even tell us what language you are finetuning, so it is hard to help …

2 Likes
  • Finetuning
  • Moziall STT v0.8.2 english model
  • Linux Ubuntu
  • Python 3.6
  • tensorflow_gpu = 1.14.0
  • CUDA 10.0/cuDNN 7.5

I am not sure about the amount of data the model was trained with, however english common voice has approved 1.5k. So i suppose, it might be a similar value.

What do you plan on doing with that? Transferring machine generated output with augmentation? Why not 100 hours if it is machine generated?

Hi @othiele,

well it is just fun for me to work with speech recogniton, thus i started to work on a little speech recognition project for a specific use case.

Well, my goal is to generate as much data as possible, but untill know I manage to generate 3 hours from the text data, that I have.
I did audio syntheisis and after that the data augmentation using some audio manipulation technique such as increasing pitch, speed, adding noise etc…
@othiele what is your approach for generating the audio?

You take real audio from people and then add augmentation data to it. You could use the same audio with different augmentations if you don’t have a lot of material. But 3 hours won’t make much (or too much) of a difference. This is simply not enough data.