Question with DeepSpeech Transfer Learning


I am collecting data on Indian English. as i do not have 1000 hours Indian English data, i can not do building Depspeech from scratch. i understand i need to do transfer learning with my 20 hour Indian English financial domain data.

Earlier i run transfer learning, i saw i need to use version 6.0 alphabet.txt that does not contains any numbers, upper case english letters and special symbols. but my training data contains numbers and many special symbols. Is there any workaround so that i can keep all numbers and special symbols in transfer learning.

or do i need to convert number 9 as nine in my training excels, convert all uppercase to lowercase and remove all special symbols from my training excels.

I’ll let @josh_meyer comment, but that might be a bit low

0.6 please.

No, if you are re-using checkpoints you need to have compatible alphabet. That’s different.

If you are relying on @josh_meyer’s transfer-learning2 not-yet-merged branch, you are free to do whatever you want.

That’s complicated, in some languages, numbers can be said several ways. If your dataset is not consistent in how people are saying numbers, you will have issues.

That depends on how you perform your transfer learning …

Thanks for the reply. you said 20 hour is bit low. so what is the minimum hr of voice data required on Indian English to do transfer learning on the top of 0.6 baseline pre trained model.

Earlier i used the following code. so i cannot modify the alphabet.txt. please confirm.

python -u --noshow_progressbar \
  --train_files /home/ubuntu/mic_deepspeech/data/voicedata2/bob.csv\
  --test_files /home/ubuntu/mic_deepspeech/data/voicedata2/bob.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 50 \
  --epochs 10 \
  --export_dir /home/ubuntu/mic_deepspeech/deepspeech-0.6.0-models/transferlearning \
  --alphabet_config_path /home/ubuntu/mic_deepspeech/data/alphabet.txt \
  --lm_binary_path /home/ubuntu/mic_deepspeech/deepspeech-0.6.0-models/lm.binary \
  --lm_trie_path /home/ubuntu/mic_deepspeech/deepspeech-0.6.0-models/trie \
  --cudnn_checkpoint /home/ubuntu/mic_deepspeech/deepspeech-0.6.0-models/checkpoints/deepspeech-0.6.0-checkpoint/ \

what is the link of @josh_meyer’s transfer-learning2 not-yet-merged branch. I will try that code to do transfer learning.
I assume i need not to create any lm.binary and trie for transfer learning, i will use transfer-learning2 branch. is there any way by which i can modify pre trained model lm.binary and trie. to do this, i need to get vocabulary.txt of 0.6 model. is vocabulary.txt available for 0.6 model.

You really need to start reading the documentation.

1 Like

what is the link of @josh_meyer’s transfer-learning2 not-yet-merged branch. I will try that code to do transfer learning.

Have you had a look at our Github ? You will see pending pull requests there …

i assume you point to this.

you said 20 hour is bit low. so what is the minimum hr of voice data required on Indian English to do transfer learning on the top of 0.6 baseline pre trained model.

Try it. The more data you have the better.

See, you can find it by yourself :slight_smile:

As I have been telling you in loop each and every time you asked that kind of question in private, it highly depends on what you have, it’s quality, and your goals.

1 Like

It is really hard to give you an exact number of hours that you’ll need for training. I would argue that around 100 hours of good input material will be a good start. Try transforming with 50 vs. 100 hours and you should see whether this works for you. If you can’t get that much, try 10 vs. 20 and have good testing data. Maybe use benchmarkstt for that.

As for the numbers try num2words and check the outputs. Also look for currencies and times.

Hi @othiele thanks for your quick peply.

I have selected some keyword intent. say fixed deposit and recurring deposit.
I have downloaded 2 hour video of each intent. thus i have total 4 hr voice. then i did pitch, speed and noise augmentation and thus voice data converted to 16 hours.

is it not good or sufficient for deepspeech training. i am not not building ASR on general domain. I am not not building finance in general. I am going by intent wise. is it not the right approach.

Or if you mean 20 hr is not sufficient, does that mean i need to go for more generic indian english data rather than be specific intent wise( like fixed deposit or recurring deposit).

If these 2 intents works out, i will keep on adding more financial intent voice.

is it good viable approach.

You should just do it and see. As repeated, this is highly dependant on your data and your goals. Right now, you have spent much more time asking and asking than it would have took you to run a few iterations and get actual, valuable feedback from your dataset and goals.


And just 4 hours of truly genuine material is not much, get 10 and blow it up :slight_smile: That might get you somewhere. And lissyx is of course right, that it depends greatly on the task. If we had exact numbers we would tell you, but this is not an exact science …

1 Like

Hello @lissyx and @othiele,
I have downloaded all the dependencies and successfully fine-tuned with 34 hours of Indian accent audio extracted from Youtube. There is a total of 16500 audio files, I trained on 13k, 2k, 1.5k (train, dev, test). But I didn’t get good accuracy, my WER is 0.29.

python3 --n_hidden 2048 --checkpoint_dir checkpoints/vlsi2 --epochs 40 --train_files …/deepspeech-0.6.1-models/audio/vlsi/train.csv --dev_files …/deepspeech-0.6.1-models/audio/vlsi/validate.csv --test_files …/deepspeech-0.6.1-models/audio/vlsi/test.csv --learning_rate 0.000001 --use_cudnn_rnn true --use_allow_growth true --lm_binary_path …/deepspeech-0.6.1-models/lm.binary --lm_trie_path …/deepspeech-0.6.1-models/trie --noearly_stop --export_dir exported_model/vlsi2 --train_batch_size 128 --dev_batch_size 128 --test_batch_size 128

The model is giving WER 0.29 on the test set, and even it predicts some of the words wrong as compared to the pre-trained model (w/o tuning).

Am I doing something wrong?
If you need any other info, plz let me know.

EDIT: I am using DeepSpeech 0.6.1

Please use current master for transfer learning.

That’s too vague to be actionable. When doing transfer learning you expect your model to “un-learn” and then “learn”, but obviously this could be a side effect.

What was WER before ?

Hello @lissyx

What was WER before?

Before fine-tuning, the results were too bad as it was an Indian Accent. After fine-tuning, the result improved a little as compared to pre-trained.
But some sentences I tested previously and after fine-tuning (not on trained data, I am talking about some random data), they were good on pretrained but after fine-tuning results aren’t good.

For example:
Sentence 1: Any random sentence
Sentence 2: Test Data from Indian Accent Corpus

Using Pre-Trained Model:
Sentence 1 has good results but not Sentence 2.

After Fine-Tuning on the same corpus:
Sentence 2 has good results but not Sentence 1.

According to me, after fine-tuning Sentence 1 and Sentence 2, both must be good or at least same as previously.

But that would require the source model trained on master, right?

I think the checkpoints are still compatible

1 Like