Question with DeepSpeech Transfer Learning

Azeem_Husain · March 5, 2020, 5:32am

Hello @lissyx and @othiele,
I have downloaded all the dependencies and successfully fine-tuned with 34 hours of Indian accent audio extracted from Youtube. There is a total of 16500 audio files, I trained on 13k, 2k, 1.5k (train, dev, test). But I didn’t get good accuracy, my WER is 0.29.

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir checkpoints/vlsi2 --epochs 40 --train_files …/deepspeech-0.6.1-models/audio/vlsi/train.csv --dev_files …/deepspeech-0.6.1-models/audio/vlsi/validate.csv --test_files …/deepspeech-0.6.1-models/audio/vlsi/test.csv --learning_rate 0.000001 --use_cudnn_rnn true --use_allow_growth true --lm_binary_path …/deepspeech-0.6.1-models/lm.binary --lm_trie_path …/deepspeech-0.6.1-models/trie --noearly_stop --export_dir exported_model/vlsi2 --train_batch_size 128 --dev_batch_size 128 --test_batch_size 128

The model is giving WER 0.29 on the test set, and even it predicts some of the words wrong as compared to the pre-trained model (w/o tuning).

Am I doing something wrong?
If you need any other info, plz let me know.

EDIT: I am using DeepSpeech 0.6.1

lissyx · March 5, 2020, 8:37am

Please use current master for transfer learning.

That’s too vague to be actionable. When doing transfer learning you expect your model to “un-learn” and then “learn”, but obviously this could be a side effect.

What was WER before ?

Azeem_Husain · March 5, 2020, 8:54am

Hello @lissyx

What was WER before?

Before fine-tuning, the results were too bad as it was an Indian Accent. After fine-tuning, the result improved a little as compared to pre-trained.
But some sentences I tested previously and after fine-tuning (not on trained data, I am talking about some random data), they were good on pretrained but after fine-tuning results aren’t good.

For example:
Sentence 1: Any random sentence
Sentence 2: Test Data from Indian Accent Corpus

Using Pre-Trained Model:
Sentence 1 has good results but not Sentence 2.

After Fine-Tuning on the same corpus:
Sentence 2 has good results but not Sentence 1.

According to me, after fine-tuning Sentence 1 and Sentence 2, both must be good or at least same as previously.

Jendker · March 5, 2020, 9:03am

But that would require the source model trained on master, right?

lissyx · March 5, 2020, 9:04am

I think the checkpoints are still compatible

lissyx · March 5, 2020, 9:06am

“too bad” “a little” are not really helpful. Please have proper figures, you might be tricked by examples.

According to me, you need to share more context on what you have tested. Fine-tuning requires work, maybe your learning rate is too high, maybe you need to tune dropout, maybe you need to train on more or less epochs. Have you had a look at train / dev loss evolution ?

othiele · March 5, 2020, 9:41am

30 plus hours from Youtube might not be the best source for learning. How do you get the transcripts, what is their quality?

And I don’t find .29 bad for 30 hours from Youtube.

And you probably have catastropic forgetting due to bad data.

Azeem_Husain · March 5, 2020, 9:57am

I have run the evaluate.py using the same checkpoint as downloaded from realease (no tuning).

Results:

Test on …/deepspeech-0.6.1-models/audio/vlsi/test.csv - WER: 0.525011, CER: 0.311230, loss: 111.989403

After Fine-Tuning on audio files extracted from YouTube.
https://www.youtube.com/playlist?list=PLCmoXVuSEVHlEJi3SwdyJ4EICffuyqpjk

Downloading the above playlist plus one more then divided the into chunks, created 16500 audio samples. (13k/2k/1.5k - train/dev/test).

python3 DeepSpeech.py 
--n_hidden 2048
--checkpoint_dir  checkpoints/deepspeech-0.6.1-checkpoint/
--epochs 50
--train_files ../deepspeech-0.6.1-models/audio/vlsi/train.csv
--dev_files ../deepspeech-0.6.1-models/audio/vlsi/validate.csv
--test_files ../deepspeech-0.6.1-models/audio/vlsi/test.csv 
--learning_rate 0.00001 
--use_cudnn_rnn true 
--use_allow_growth true 
--lm_binary_path ../deepspeech-0.6.1-models/lm.binary 
--lm_trie_path ../deepspeech-0.6.1-models/trie 
--noearly_stop 
--dropout_rate 0.15 
--export_dir exported_model/vlsi2 
--train_batch_size 64 
--dev_batch_size 64 
--test_batch_size 64

Results:

Test on …/deepspeech-0.6.1-models/audio/vlsi/test.csv - WER: 0.203588, CER: 0.105375, loss: 38.082546

Using Random Audio File:
Pre-Trained Results:

Test on …/deepspeech-0.6.1-models/audio/fluent_speech/csv/test.csv - WER: 0.263332, CER: 0.125334, loss: 12.215734

Fine-Tuned Results:

Test on …/deepspeech-0.6.1-models/audio/fluent_speech/csv/test.csv - WER: 0.478590, CER: 0.276625, loss: 24.773346

If you need any other info, plz let me know

Azeem_Husain · March 5, 2020, 10:01am

Hi @othiele
I have shared the link of YouTube Playlist. I used the transcript provided by the channel and did some pre-processing.

And I don’t find .29 bad for 30 hours from Youtube.

Yeah, even I did some hyperparameter tuning and got WER 0.20 (plz refer my last comment).
But the problem is, it disturbed the previous weights.

lissyx · March 5, 2020, 10:02am

So, on your YouTube-based test-set, you have 52.5% WER and 31.1% CER before fine-tuning, and 20.4% WER / 10.5% CER after fine-tuning with ~30h of data ?

That indeed does look like a very nice improvement.

Well, that’s going to be the issue you have to work on.

I suspect you want more than just those validation and test set if you want to avoid degrading quality on previous data. Otherwise, it makes sense that the new learning optimizes for the new data.

othiele · March 5, 2020, 10:12am

What did you change for fine-tuning as you might be “overfitting” for the new data now?

And how do you cut the chunks of the videos? The videos look OK for training as the speaker talks slowly and it doesn’t have much background noise?

Azeem_Husain · March 5, 2020, 10:18am

I cut down the audio using the video’s SRT file using Pydub, converted into mono channel + 16k frame_rate exported as a wav file.

I mentioned every step I did for fine tuning in previous comments (plz refer).

othiele · March 5, 2020, 10:23am

Check the created chunks, cutting just by SRT might give you bad results.

Sorry, thought you also changed lm values, don’t for now. It is more likely your data.

Azeem_Husain · March 5, 2020, 10:30am

Ok, Thanks for the suggestion, I will try to get something better than this.

No, I haven’t changed anything in LM or Trie. I am using the same LM and Trie from the pre-trained model

Azeem_Husain · March 5, 2020, 10:39am

Any suggestions from your side? It would be great.

I suspect you want more than just those validation and test set if you want to avoid degrading quality on previous data.

Sorry, I didn’t get this point. Can you plz elaborate?

lissyx · March 5, 2020, 10:46am

I don’t have your dataset, I can’t do your work there. I’ve already shared suggestions.

I don’t see how to say it otherwise: you are fine-tuning and using only one validation set, so your network is getting optimized for this one. That’s also why it regressed on previous dataset.

othiele · March 5, 2020, 10:51am

I guess - stating that in plain English - you could go for an even lower learning rate or put more plain English examples into the validation set so as to alter the original weights a little less

reuben · March 5, 2020, 10:56am

One thing about transfer learning on master: the checkpoints from v0.6.1 are not compatible with master, due to a bugfix in the MFCC computation code. But they will still load, just give bad results. Make sure you don’t mix those up.

Azeem_Husain · March 5, 2020, 11:00am

Ohh, now I got your points.
Thanks a lot.

lissyx · March 5, 2020, 11:01am

True, forgot about that. Are you referring to the upper limit of frequency? Maybe in this case, it’s hackable by removing it. Being able to use up-to-date transfer learning from 0.6.1 model will likely bring more good than harm?