Model is inferencing bad after finetuning

buxbaum · October 21, 2020, 10:10am

Hello,

I did finetune deepspeech english model with my custom data (5-7 sec audio, 3 hours of audio in general, most of them was an machine generated data thorugh data augmentation).
First I trained for 3 epochs (lr = 0.0001, batch_size =16, without dropout)
The loss function was getting lower till the value of 3, then I tested it and the result was really bad , much worst then before finetuning. I tried to fine tune again for 10 epochs with the same hyperparameter and the loss function lowered till 0.8 , but the prediction was still really really bad… I want to also mention that I did not get any warnings during the finetuning. Overfitting?

I did the same experiments with german data and the transcription improved after finetuning, even though I used a smaller data set. The loss funtion was about 60.

Could someone give me an advise or tell me what could go wrong? should i use dropout= 0.4 ?

othiele · October 21, 2020, 11:48am

Not much, but you don’t state what you use as the basis …

Dropout is always on your side.

Did you search for learning rates in fine tuning or transfer learning?

That is your job.

You don’t even tell us what language you are finetuning, so it is hard to help …

buxbaum · October 21, 2020, 1:53pm

Finetuning
Moziall STT v0.8.2 english model
Linux Ubuntu
Python 3.6
tensorflow_gpu = 1.14.0
CUDA 10.0/cuDNN 7.5

I am not sure about the amount of data the model was trained with, however english common voice has approved 1.5k. So i suppose, it might be a similar value.

othiele · October 22, 2020, 8:14am

What do you plan on doing with that? Transferring machine generated output with augmentation? Why not 100 hours if it is machine generated?

buxbaum · October 22, 2020, 9:07am

Hi @othiele,

well it is just fun for me to work with speech recogniton, thus i started to work on a little speech recognition project for a specific use case.

Well, my goal is to generate as much data as possible, but untill know I manage to generate 3 hours from the text data, that I have.
I did audio syntheisis and after that the data augmentation using some audio manipulation technique such as increasing pitch, speed, adding noise etc…
@othiele what is your approach for generating the audio?

othiele · October 22, 2020, 10:18am

You take real audio from people and then add augmentation data to it. You could use the same audio with different augmentations if you don’t have a lot of material. But 3 hours won’t make much (or too much) of a difference. This is simply not enough data.

Topic		Replies	Views
Fine tuning a model results in overfitting DeepSpeech	0	744	January 22, 2019
Training of Epoch x - loss: inf. again decrease LR, stil same issue repeating DeepSpeech	10	827	October 18, 2018
Terrible Accuracy? DeepSpeech	33	5816	November 2, 2019
Lower accuracy after Fine-tuning DeepSpeech	0	321	March 22, 2021
Deepspeech model DeepSpeech	4	948	September 24, 2019

Model is inferencing bad after finetuning

Related topics