Has anyone successfully fine tuned a deepspeech mode?

I’ve been running different fine tuning tests on the 0.1.1 release by just training on single voices to see if I could improve the model performance, I train with the default command:

python3 DeepSpeech.py --n_hidden 2048 --initialize_from_frozen_model path/to/model/output_graph.pb --checkpoint_dir fine_tuning_checkpoints --epoch 3 --train_files my-train.csv --dev_files my-dev.csv --test_files my_dev.csv --learning_rate 0.0001

I trained with 1 hour of audio for 1-3 epochs and the model never seems to improve. I’m wondering if I’m doing something wrong or the code is not working. Has anyone else here successfully trained a better performing model using fine tuning on top of deepspeech?

I don’t know if your audio file is a single file or chunked into multiple parts, but fine tuning with short audio segments worked for me.

It is trained on 10 second chunks. How do you know it works fine? And what type of model did you train it, was it a new language or new voice? As in how did you measure it? What did you use for learning rate and how many epochs? Would appreciate any information you can share, thank you!

I started from the frozen model of 0.1.1. I trained on new voice dataset with the default hyperparameters. It works, since it works :smiley: The only trick that comes to my mind is: at first I tried to train on chunks of equal length. I had segmented my large audio file into segments of 3 seconds each. This didn’t work. It seems that somehow it prefers unequal lengths.

may I ask, how many hours of audio did you use and how long did it take on our hardware?