Hi all.
I have to fine-tune models for various accents of Spanish, English and French. My plan is to use the checkpoints found here https://gitlab.com/Jaco-Assistant/deepspeech-polyglot#language-models-and-checkpoints and then retrain with the audios I have with specific accents. I have some questions on how to do this properly.
- When training, do I have to use audios with the same sample rate as the ones used to generate the checkpoints? For example, my Spanish audios sample rate are 8khz, but I believe that checkpoints I am using were generated by training with 16khz audios. Hence, I am receiving the following message in the Optimization step:
‘WARNING: sample rate of sample “…/train.wav” ( 8000 ) does not match FLAGS.audio_sample_rate. This can lead to incorrect results.’
And the transcription on the test files, which are also 8khz, are just an empty strings.
I guess that the way to proceed is to previously transform all my .wav files to 16hz, but I thought that DeepSpeech was already supporting training with different samplerates. Is that incorrect? I know that client.py has that feature enabled, but I am not sure if that is also working for training.
-
When fine tuning, I should use the --load_evaluate last flag right? So that testing uses the new trained checkpoints, and not the ones that I am using as a starting point.
-
How do I know how much data do I need for finetuning? For example, if I use only 1 data point for fine tuning, model performance actually decreases. I guess this makes sense cause the model might be overfitting that single data point.
Thanks a lot
Code I am running:
python3 -u DeepSpeech.py \
--train_files .../train.csv \
--test_files .../test.csv \
--train_batch_size 1 \
--test_batch_size 1 \
--load_cudnn true \
--epochs 3 \
--checkpoint_dir .../DeepSpeech-Ployglot-ES-20201026T155049Z-001/checkpoint/cclmtv \
--learning_rate 0.0001 \
--alphabet_config_path ../deepspeech-polyglot/data/alphabet_es.txt
--load_evaluate last