Continuing training saving new 'best' checkpoint


I am training a dataset for Urdu in the native text and successfully used transfer learning from the English pretrained model to achieve a loss of 36.299953 after 80 epochs on this data. I want to further improve on this by adjusting the parameters and applying some augmentation through Deep Speech.

The one big question I have is that if we are “continuing” training, why is the new best validating model saved if it is not better than the one used in the previous run?

The other question I have is what techniques can we use to decrease this loss rate? This is the command I am using to “continue” training.

–drop_source_layers 2
–alphabet_config_path /$HOME/Uploads/UrduAlphabet_newscrawl2.txt
–load_checkpoint_dir /$HOME/DeepSpeech/dataset/trained_load_checkpoint
–save_checkpoint_dir /$HOME/DeepSpeech/dataset/trained_load_checkpoint
–train_files /$HOME/Uploads/trains55final.csv
–dev_files /$HOME/Uploads/devs55final.csv
–epochs 30
–train_batch_size 32
–export_dir /$HOME/DeepSpeech/dataset/urdu_trained
–export_file_name urdu
–learning_rate 0.00001
–scorer /$HOME/Uploads/kenlmnew.scorer
–n_hidden 2048
–dropout_rate 0.2
–train_cudnn true

I now want to adjust the parameters to continue and try to improve the loss.

How much difference would one form of augmentation alone make to our data? Or would it be more useful to use multiple augmentations together in the same run?

I know you can’t “think” for me but I am looking for a pointer to try and improve this. Will running the same data set (around 60 hours) produce better loss with different augmentation combinations?

The WER at 80 epochs is around 58% with a loss of 36.3. The training loss is at 32. Both continue to decrease so I know it is not overfitting and continuing training will reduce this a bit.

On other data sets, the training loss continues to decrease but validation loss starts increasing - based on other forum questions, that is overfitting, is my understanding correct?

Check your checkpoint dir. It has a text file that states what the last saved and best model are.

More material, maybe augmentation.

Augmentation is quite new, search the forum and experiment.

Yes, this is typical for fresh trainings. Look at the es_epochs flag.

And maybe use reduce_lr_on_plateau . In general, 60 hours might not be enough to transfer to a WER of 15. dan.bmh, for example, uses 2000 hours to transfer.

ah, and why do you drop 2 layers?

Because many posts referencing transfer learning have done the same. The Urdu text is with a completely new alphabet so I am presuming that is the way to go?

Please read and understand the docs. By scraping 2 layers off the model you loose quite a lot of trained weights. The more you leave out, the more you have to retrain. You might be better off with dropping just 1 layer. I usually have the same alphabet, but with such few data as you have, it might yield better results.

But in the end you’ll have to use more data. Think of 600 instead of 60 hours.