Continue training with the CPU of a model with transfer learning

Pablo · December 24, 2020, 3:55pm

My goal is to take an existing model, in Italian language ,with the transfer-learning method and continue its training with other datasets of my choice, obtaining a new model as output. The starting model is the one released in the latest release of DeepSpeech-Italian-Model on GitHub I refer to the file called ‘transfer_model_tensorflow_it.tar.xz’. To continue the training I understood that it is necessary to use the checkpoint files, which in this case are always released with the same release, I refer to the file ‘transfer_checkpoint_it.tar.xz’. The model hyper-parameters declared for model training are as follows:

batch_size=64
n_hidden=2048
epochs=30
learning_rate=0.0001
dropout=0.4
lm_alpha=0
lm_beta=0
es_epochs=10
early_stop=1
amp=0
drop_source_layer=1

Assuming that I need to continue training the starting model (using only the CPU, then inserting '--load_cudnn False') with a test dataset called “cv-tiny”, I launched the following command:

python3 DeepSpeech.py \
--load_cudnn False \
--alphabet_config_path /alphabet.txt \
--checkpoint_dir /transfer_checkpoint_it \
--train_files cv-tiny/train.csv \
--dev_files cv-tiny/dev.csv \
--test_files cv-tiny/test.csv \
--scorer_path /scorer \
“””
hyperparameters declared in the starting model
“””
--train_batch_size 64 \
--dev_batch_size 64 \
--test_batch_size 64 \
--n_hidden 2048 \
--epochs 30 \
--learning_rate 0.0001 \
--dropout_rate 0.4 \
--es_epochs 10 \
--early_stop 1 \
--drop_source_layers 1 \
“””
files export
“””
--export_dir /ckpt/ \
--export_file_name 'output_graph'

Is it correct to use the same hyper-parameters of the starting model with the exception of '--lm_alpha = 0' and '--lm_beta = 0'? In the transfer_flag.txt file released in the release, the values amount to '--lm_alpha = 0.931289039105002' and '--lm_beta = 1.1834137581510284'.
Is it correct to load the ‘score’ file of the starting model in this way '--scorer_path / scorer' or is it not necessary?
Do I still have to enter '--drop_source_layers 1'?
why during model testing I get bad results (‘src’ is intended as the original sentence and ‘res’ final transcription) such as:

WER: 1.000000, CER: 1.484848, loss: 1267.066528
 - wav: file:///home/cv-tiny/common_voice_it_19973815.wav
 - src: "alan clarke"
 - res: "mnmnmnmnmnmnm uguaglianza bumburubumbububum"

The model has already been trained with a sufficient number of hours and should perform much better during testing. I report the whole process:

(env) root@pablo-G5-5590:/home/pablo/deep-speech/DeepSpeech-r0.9# python3 DeepSpeech.py \
> --load_cudnn False \
> --alphabet_config_path /home/pablo/deep-speech/transfer_model_tensorflow_it/alphabet.txt \
> --checkpoint_dir /mnt/checkpoints \
> --train_files   /home/pablo/deep-speech/cv-tiny/train.csv \
> --dev_files   /home/pablo/deep-speech/cv-tiny/dev.csv \
> --test_files  /home/pablo/deep-speech/cv-tiny/test.csv \
> --scorer_path /home/pablo/deep-speech/transfer_model_tensorflow_it/scorer \
> --train_batch_size 64 \
> --dev_batch_size 64 \
> --test_batch_size 64 \
> --n_hidden 2048 \
> --epochs 30 \
> --learning_rate 0.0001 \
> --dropout_rate 0.4 \
> --es_epochs 10 \
> --early_stop 1 \
> --drop_source_layers 1 \
> --export_dir /home/pablo/deep-speech/ckpt/ \
> --export_file_name 'ft_model' 

I Loading best validating checkpoint from /mnt/checkpoints/best_dev-754152
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I Initializing variable: layer_6/bias
I Initializing variable: layer_6/bias/Adam
I Initializing variable: layer_6/bias/Adam_1
I Initializing variable: layer_6/weights
I Initializing variable: layer_6/weights/Adam
I Initializing variable: layer_6/weights/Adam_1
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 0 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 0 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
I Saved new best validating model with loss 852.037170 to: /mnt/checkpoints/best_dev-754152
--------------------------------------------------------------------------------
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 1 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 1 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 1 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 2 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 2 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 2 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 2 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 3 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 3 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 3 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 3 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 4 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 4 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 4 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 5 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 5 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 5 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 5 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 6 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 6 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 6 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 6 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 7 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 7 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 7 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 7 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 8 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 8 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 8 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 8 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 9 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000       
Epoch 9 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 9 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | DatEpoch 9 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv
--------------------------------------------------------------------------------
Epoch 10 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000      
Epoch 10 | Validation | Elapsed Time: 0:00:06 | Steps: 1 | Loss: 852.037170 | Dataset: /home/pablo/deep-speech/cv-tiny/dev.csv                                                                
I Early stop triggered as the loss did not improve the last 10 epochs
I FINISHED optimization in 0:01:21.120896
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-754152
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/pablo/deep-speech/cv-tiny/test.csv
Test epoch | Steps: 1 | Elapsed Time: 0:00:09                                                                                                                                                        
Test on /home/pablo/deep-speech/cv-tiny/test.csv - WER: 1.000000, CER: 1.000000, loss: 897.548157
--------------------------------------------------------------------------------
Best WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.246753, loss: 1364.445801
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20001185.wav
 - src: "in seguito kygo e shear hanno proposto di continuare a lavorare sulla canzone"
 - res: "mnmnmnmnmnmnm mnmnmnmnmnmnm novantanove neurodegenerative e unoperazione buonaparte furstenfeldbruck bisbisbisbisbisbis"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.484848, loss: 1267.066528
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_19973815.wav
 - src: "vi furono internati ebrei e profughi slavi provenienti dai balcani"
 - res: "manderebbe nerobianconerobianconerobianconerobianconerobianco neurofibromatosi pulitissimo neolaureata beauxbatons e sbirbimababu"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.951220, loss: 1070.170776
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20045040.wav
 - src: "fin dall'inizio la sede episcopale è stata immediatamente soggetta alla santa sede"
 - res: "non mnmnmnmnmnmnm nerobianconerobianconerobianconerobianconerobianco effettuerebbero bubububù"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.184211, loss: 902.796448
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20059124.wav
 - src: "la parte superiore della facciata comprende una finestra rettangolare murata"
 - res: "mnmnmnmnmnmnm bambinimiracolosidilahiri biancobiancobiancobiancobianco ugualmente aufnahmeausschusssitzung membri"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.060241, loss: 891.478088
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20042813.wav
 - src: "dopo alcuni anni egli decise di tornare in india per raccogliere altri insegnamenti"
 - res: "mnmnmnmnmnmnm dinosauro buongustaio separatamente autocensurerebbero fermerebbe perfettamente bisbisbisbisbisbis"
--------------------------------------------------------------------------------
Median WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.951220, loss: 1070.170776
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20045040.wav
 - src: "fin dall'inizio la sede episcopale è stata immediatamente soggetta alla santa sede"
 - res: "non mnmnmnmnmnmnm nerobianconerobianconerobianconerobianconerobianco effettuerebbero bubububù"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.184211, loss: 902.796448
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20059124.wav
 - src: "la parte superiore della facciata comprende una finestra rettangolare murata"
 - res: "mnmnmnmnmnmnm bambinimiracolosidilahiri biancobiancobiancobiancobianco ugualmente aufnahmeausschusssitzung membri"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.060241, loss: 891.478088
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20042813.wav
 - src: "dopo alcuni anni egli decise di tornare in india per raccogliere altri insegnamenti"
 - res: "mnmnmnmnmnmnm dinosauro buongustaio separatamente autocensurerebbero fermerebbe perfettamente bisbisbisbisbisbis"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.759259, loss: 882.784058
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20033266.wav
 - src: "particolare riguardo è riservato alla produzione da agricoltura biologica sempre più diffusa nella provincia"
 - res: "mandarinadorme bustamontesecondo piupericoloso biofluorescente furtivamente furuhatauna bibibibi"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.860000, loss: 869.030579
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20060953.wav
 - src: "è anche supportata una cifratura utente end to end"
 - res: "mnmnmnmnmnmnm nabucodonosor uòresce fuhrerhauptquartiere un'esagerazione unopinione unafalciatrice bumburubumbububum"
--------------------------------------------------------------------------------
Worst WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.060241, loss: 891.478088
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20042813.wav
 - src: "dopo alcuni anni egli decise di tornare in india per raccogliere altri insegnamenti"
 - res: "mnmnmnmnmnmnm dinosauro buongustaio separatamente autocensurerebbero fermerebbe perfettamente bisbisbisbisbisbis"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.759259, loss: 882.784058
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20033266.wav
 - src: "particolare riguardo è riservato alla produzione da agricoltura biologica sempre più diffusa nella provincia"
 - res: "mandarinadorme bustamontesecondo piupericoloso biofluorescente furtivamente furuhatauna bibibibi"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.860000, loss: 869.030579
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_20060953.wav
 - src: "è anche supportata una cifratura utente end to end"
 - res: "mnmnmnmnmnmnm nabucodonosor uòresce fuhrerhauptquartiere un'esagerazione unopinione unafalciatrice bumburubumbububum"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 2.000000, loss: 462.202454
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_17544185.wav
 - src: "il vuoto assoluto"
 - res: "mnmnmnmnmnmnm incensurato finanzierebbe "
--------------------------------------------------------------------------------
WER: 1.500000, CER: 3.363636, loss: 367.958771
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_19997999.wav
 - src: "alan clarke"
 - res: "mnmnmnmnmnmnm uguaglianza bumburubumbububum"
--------------------------------------------------------------------------------
I Exporting the model...
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-754152
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
I Models exported at /home/pablo/deep-speech/ckpt/
I Model metadata file saved to /home/pablo/deep-speech/ckpt/author_model_0.0.1.md. Before submitting the exported model for publishing make sure all information in the metadata file is correct, and complete the URL fields.

othiele · December 28, 2020, 9:15am

alpha and beta are just for testing, not for training. Use lm-optimizer to find right values after training.

Scorer is also only for testing, not for training. So if you want to test, then yes a scorer is helpful.

At this point I wonder whether you have read the docs? Answer: It depends.

Doesn’t sound Italian to me …

Give us numbers. What is sufficient? What do you want to do with the model?

Pablo · December 28, 2020, 2:06pm

First of all, I thank you immensely for your response.

Yes you are right maybe I did not take the perfect example but you can see others in the last part of the output that I shared. Another example is:

WER: 1.000000, CER: 2.000000, loss: 462.202454
 - wav: file:///home/pablo/deep-speech/cv-tiny/common_voice_it_17544185.wav
 - src: "il vuoto assoluto"
 - res: "mnmnmnmnmnmnm incensurato finanzierebbe "

According to what was declared by the release of deepspeech italia, the model was trained with about 257 hours of audio in Italian using also the tranfer learning of the English model. In fact, if I use that same model to translate the phrase "il vuoto assoluto"(as in the previous example) it is translated correctly. So I can’t explain why the model that previously transcribed a sentence correctly if trailed with some other test clips (50 or less clips) during testing is no longer able to transcribe correctly.

My main doubt is that I had made some conceptual errors with the flags when running this command

Pablo:

python3 DeepSpeech.py \
--load_cudnn False \
--alphabet_config_path /alphabet.txt \
--checkpoint_dir /transfer_checkpoint_it \
--train_files cv-tiny/train.csv \
--dev_files cv-tiny/dev.csv \
--test_files cv-tiny/test.csv \
--scorer_path /scorer \
--train_batch_size 64 \
--dev_batch_size 64 \
--test_batch_size 64 \
--n_hidden 2048 \
--epochs 30 \
--learning_rate 0.0001 \
--dropout_rate 0.4 \
--es_epochs 10 \
--early_stop 1 \
--drop_source_layers 1 \
--export_dir /ckpt/ \
--export_file_name 'output_graph'

Sorry my english and thanks again for your attention.

lissyx · December 28, 2020, 3:33pm

CPU only will be super hyper slow, I doubt you can achieve anything at all of transfer learning

Pablo · December 28, 2020, 5:10pm

I have hp proliant dl380 gen 9 server. I thought I could train deepspeech with its cpu, the server has no gpu and i don’t even know which one it supports. Do you think this is not possible?

lissyx · December 28, 2020, 5:13pm

It’s not about being possible or not, it will train for sure, but even if you don’t need thousands of hours of audio for fine tuning / transfer learning, you still require some decent amount, and the speed factor between GPU and CPU is several order of magnitude, so it will take maybe 100 or 1000 more time on CPU than on GPU.

othiele · December 29, 2020, 8:26am

250 hours of Italian is a really low number, 2000 would be better. If you fine tune that with just 50 samples and such a high learning rate, it will lead to desaster. As you have seen.

Please read a bit about fine tuning in deep learning and search for appropriate learning rates here in the forum and get 50 more hours of material to use for fine tuning.