Fine-Tuning with different sample rate than the one used to create checkpoints

fhalamos · January 20, 2021, 9:42pm

Hi all.

I have to fine-tune models for various accents of Spanish, English and French. My plan is to use the checkpoints found here https://gitlab.com/Jaco-Assistant/deepspeech-polyglot#language-models-and-checkpoints and then retrain with the audios I have with specific accents. I have some questions on how to do this properly.

When training, do I have to use audios with the same sample rate as the ones used to generate the checkpoints? For example, my Spanish audios sample rate are 8khz, but I believe that checkpoints I am using were generated by training with 16khz audios. Hence, I am receiving the following message in the Optimization step:
‘WARNING: sample rate of sample “…/train.wav” ( 8000 ) does not match FLAGS.audio_sample_rate. This can lead to incorrect results.’
And the transcription on the test files, which are also 8khz, are just an empty strings.

I guess that the way to proceed is to previously transform all my .wav files to 16hz, but I thought that DeepSpeech was already supporting training with different samplerates. Is that incorrect? I know that client.py has that feature enabled, but I am not sure if that is also working for training.

When fine tuning, I should use the --load_evaluate last flag right? So that testing uses the new trained checkpoints, and not the ones that I am using as a starting point.
How do I know how much data do I need for finetuning? For example, if I use only 1 data point for fine tuning, model performance actually decreases. I guess this makes sense cause the model might be overfitting that single data point.

Thanks a lot

Code I am running:

python3 -u DeepSpeech.py \
  --train_files .../train.csv \
  --test_files .../test.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --load_cudnn true \
  --epochs 3 \
  --checkpoint_dir  .../DeepSpeech-Ployglot-ES-20201026T155049Z-001/checkpoint/cclmtv \
  --learning_rate 0.0001 \
  --alphabet_config_path ../deepspeech-polyglot/data/alphabet_es.txt
  --load_evaluate last

othiele · January 20, 2021, 10:14pm

Ideally yes. Most people just use 16 KHz. But search this forum for “8 KHz” and you’ll find some good comments on upsampling, …

You have a really high learning rate, maybe you are shaping off too much. Oh, are taking away a layer at all? Doesn’t look like it.

In theory no, but read the other comments. Best to stick to just one system.

No, please read about deep learning somewhere. The last checkpoint is not necessarily the best.

You mean a single file of a couple seconds. This won’t lead to anything. Fine tuning usually requires thousands of chunks. Depending on the task even more. The models are training with millions, use tens of thousands to fine tune.

othiele · January 20, 2021, 10:17pm

And you have no dev set. This will lead nowhere. Study some of the examples or other code here in the forum. And start reading the docs.