Training results with 0.4.1 far worse than 0.3.0

I used the following command to train with 0.3.0:
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ~/checkpoint_retrain --epoch -10 --train_files /home/rsandhu/iitm_data/assamese-male-train.csv,/home/rsandhu/iitm_data/assamese-female-train.csv,/home/rsandhu/iitm_data/bengali-male-train.csv,/home/rsandhu/iitm_data/gujarati-male-train.csv,/home/rsandhu/iitm_data/gujarati-female-train.csv,/home/rsandhu/iitm_data/hindi-male-train.csv,/home/rsandhu/iitm_data/hindi-female-train.csv,/home/rsandhu/iitm_data/kannada-male-train.csv,/home/rsandhu/iitm_data/kannada-female-train.csv,/home/rsandhu/iitm_data/malayalam-male-train.csv,/home/rsandhu/iitm_data/malayalam-female-train.csv,/home/rsandhu/iitm_data/manipuri-male-train.csv,/home/rsandhu/iitm_data/manipuri-female-train.csv,/home/rsandhu/iitm_data/rajasthani-male-train.csv,/home/rsandhu/iitm_data/rajasthani-female-train.csv,/home/rsandhu/iitm_data/tamil-male-train.csv,/home/rsandhu/iitm_data/tamil-female-train.csv --dev_files /home/rsandhu/iitm_data/assamese-male-dev.csv,/home/rsandhu/iitm_data/assamese-female-dev.csv,/home/rsandhu/iitm_data/bengali-male-dev.csv,/home/rsandhu/iitm_data/gujarati-male-dev.csv,/home/rsandhu/iitm_data/gujarati-female-dev.csv,/home/rsandhu/iitm_data/hindi-male-dev.csv,/home/rsandhu/iitm_data/hindi-female-dev.csv,/home/rsandhu/iitm_data/kannada-male-dev.csv,/home/rsandhu/iitm_data/kannada-female-dev.csv,/home/rsandhu/iitm_data/malayalam-male-dev.csv,/home/rsandhu/iitm_data/malayalam-female-dev.csv,/home/rsandhu/iitm_data/manipuri-male-dev.csv,/home/rsandhu/iitm_data/manipuri-female-dev.csv,/home/rsandhu/iitm_data/rajasthani-male-dev.csv,/home/rsandhu/iitm_data/rajasthani-female-dev.csv,/home/rsandhu/iitm_data/tamil-male-dev.csv,/home/rsandhu/iitm_data/tamil-female-dev.csv --test_files /home/rsandhu/iitm_data/assamese-male-test.csv,/home/rsandhu/iitm_data/assamese-female-test.csv,/home/rsandhu/iitm_data/bengali-male-test.csv,/home/rsandhu/iitm_data/gujarati-male-test.csv,/home/rsandhu/iitm_data/gujarati-female-test.csv,/home/rsandhu/iitm_data/hindi-male-test.csv,/home/rsandhu/iitm_data/hindi-female-test.csv,/home/rsandhu/iitm_data/kannada-male-test.csv,/home/rsandhu/iitm_data/kannada-female-test.csv,/home/rsandhu/iitm_data/malayalam-male-test.csv,/home/rsandhu/iitm_data/malayalam-female-test.csv,/home/rsandhu/iitm_data/manipuri-male-test.csv,/home/rsandhu/iitm_data/manipuri-female-test.csv,/home/rsandhu/iitm_data/rajasthani-male-test.csv,/home/rsandhu/iitm_data/rajasthani-female-test.csv,/home/rsandhu/iitm_data/tamil-female-test.csv,/home/rsandhu/iitm_data/tamil-male-test.csv --learning_rate 0.0001 --train_batch_size 24 --dev_batch_size 48 --test_batch_size 48 --display_step 0 --validation_step 1 --dropout_rate 0.2 --checkpoint_step 1 --decoder_library_path binaries/libctc_decoder_with_kenlm.so --export_dir ~/new_model

Here are the results of test epoch:

I used the following command for training with 0.4.1 with the same data:
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ~/deepspeech-0.4.1-checkpoint --epoch -10 --train_files /home/rsandhu/iitm_data/assamese-male-train.csv,/home/rsandhu/iitm_data/assamese-female-train.csv,/home/rsandhu/iitm_data/bengali-male-train.csv,/home/rsandhu/iitm_data/gujarati-male-train.csv,/home/rsandhu/iitm_data/gujarati-female-train.csv,/home/rsandhu/iitm_data/hindi-male-train.csv,/home/rsandhu/iitm_data/hindi-female-train.csv,/home/rsandhu/iitm_data/kannada-male-train.csv,/home/rsandhu/iitm_data/kannada-female-train.csv,/home/rsandhu/iitm_data/malayalam-male-train.csv,/home/rsandhu/iitm_data/malayalam-female-train.csv,/home/rsandhu/iitm_data/manipuri-male-train.csv,/home/rsandhu/iitm_data/manipuri-female-train.csv,/home/rsandhu/iitm_data/rajasthani-male-train.csv,/home/rsandhu/iitm_data/rajasthani-female-train.csv,/home/rsandhu/iitm_data/tamil-male-train.csv,/home/rsandhu/iitm_data/tamil-female-train.csv --dev_files /home/rsandhu/iitm_data/assamese-male-dev.csv,/home/rsandhu/iitm_data/assamese-female-dev.csv,/home/rsandhu/iitm_data/bengali-male-dev.csv,/home/rsandhu/iitm_data/gujarati-male-dev.csv,/home/rsandhu/iitm_data/gujarati-female-dev.csv,/home/rsandhu/iitm_data/hindi-male-dev.csv,/home/rsandhu/iitm_data/hindi-female-dev.csv,/home/rsandhu/iitm_data/kannada-male-dev.csv,/home/rsandhu/iitm_data/kannada-female-dev.csv,/home/rsandhu/iitm_data/malayalam-male-dev.csv,/home/rsandhu/iitm_data/malayalam-female-dev.csv,/home/rsandhu/iitm_data/manipuri-male-dev.csv,/home/rsandhu/iitm_data/manipuri-female-dev.csv,/home/rsandhu/iitm_data/rajasthani-male-dev.csv,/home/rsandhu/iitm_data/rajasthani-female-dev.csv,/home/rsandhu/iitm_data/tamil-male-dev.csv,/home/rsandhu/iitm_data/tamil-female-dev.csv --test_files /home/rsandhu/iitm_data/assamese-male-test.csv,/home/rsandhu/iitm_data/assamese-female-test.csv,/home/rsandhu/iitm_data/bengali-male-test.csv,/home/rsandhu/iitm_data/gujarati-male-test.csv,/home/rsandhu/iitm_data/gujarati-female-test.csv,/home/rsandhu/iitm_data/hindi-male-test.csv,/home/rsandhu/iitm_data/hindi-female-test.csv,/home/rsandhu/iitm_data/kannada-male-test.csv,/home/rsandhu/iitm_data/kannada-female-test.csv,/home/rsandhu/iitm_data/malayalam-male-test.csv,/home/rsandhu/iitm_data/malayalam-female-test.csv,/home/rsandhu/iitm_data/manipuri-male-test.csv,/home/rsandhu/iitm_data/manipuri-female-test.csv,/home/rsandhu/iitm_data/rajasthani-male-test.csv,/home/rsandhu/iitm_data/rajasthani-female-test.csv,/home/rsandhu/iitm_data/tamil-female-test.csv,/home/rsandhu/iitm_data/tamil-male-test.csv --learning_rate 0.0001 --train_batch_size 24 --dev_batch_size 48 --test_batch_size 48 --display_step 0 --validation_step 1 --dropout_rate 0.2 --checkpoint_step 1 --lm_alpha 0.75 --lm_beta 1.85 --export_dir ~/new_model

And the results are as below:

I haven’t been able to observe the loss for validation for each of the epochs because of WARNING:root:frame length (1536) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid occurring several hundred times whenever a csv is being preprocessed

@rajpuneet.sandhu Please post proper usable debug informations, using code formatting for console output, and no screenshots.

@lissyx, I was using a VM through Putty and took the screenshots. I closed it yesterday and I no longer have that information. Can we please make it work this one time?

No. And that should not stop you from fixing the console output.

There’s way too few informations in your post to actually understand what your problem here is. You have WER of 0.14 with 0.4.1 and WER of 0.17 with 0.3.0, that’s improvement, not regression.

0.3.0:
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ~/checkpoint_retrain --epoch -10 --train_files /home/rsandhu/iitm_data/assamese-male-train.csv,/home/rsandhu/iitm_data/assamese-female-train.csv,/home/rsandhu/iitm_data/bengali-male-train.csv,/home/rsandhu/iitm_data/gujarati-male-train.csv,/home/rsandhu/iitm_data/gujarati-female-train.csv,/home/rsandhu/iitm_data/hindi-male-train.csv,/home/rsandhu/iitm_data/hindi-female-train.csv,/home/rsandhu/iitm_data/kannada-male-train.csv,/home/rsandhu/iitm_data/kannada-female-train.csv,/home/rsandhu/iitm_data/malayalam-male-train.csv,/home/rsandhu/iitm_data/malayalam-female-train.csv,/home/rsandhu/iitm_data/manipuri-male-train.csv,/home/rsandhu/iitm_data/manipuri-female-train.csv,/home/rsandhu/iitm_data/rajasthani-male-train.csv,/home/rsandhu/iitm_data/rajasthani-female-train.csv,/home/rsandhu/iitm_data/tamil-male-train.csv,/home/rsandhu/iitm_data/tamil-female-train.csv --dev_files /home/rsandhu/iitm_data/assamese-male-dev.csv,/home/rsandhu/iitm_data/assamese-female-dev.csv,/home/rsandhu/iitm_data/bengali-male-dev.csv,/home/rsandhu/iitm_data/gujarati-male-dev.csv,/home/rsandhu/iitm_data/gujarati-female-dev.csv,/home/rsandhu/iitm_data/hindi-male-dev.csv,/home/rsandhu/iitm_data/hindi-female-dev.csv,/home/rsandhu/iitm_data/kannada-male-dev.csv,/home/rsandhu/iitm_data/kannada-female-dev.csv,/home/rsandhu/iitm_data/malayalam-male-dev.csv,/home/rsandhu/iitm_data/malayalam-female-dev.csv,/home/rsandhu/iitm_data/manipuri-male-dev.csv,/home/rsandhu/iitm_data/manipuri-female-dev.csv,/home/rsandhu/iitm_data/rajasthani-male-dev.csv,/home/rsandhu/iitm_data/rajasthani-female-dev.csv,/home/rsandhu/iitm_data/tamil-male-dev.csv,/home/rsandhu/iitm_data/tamil-female-dev.csv --test_files /home/rsandhu/iitm_data/assamese-male-test.csv,/home/rsandhu/iitm_data/assamese-female-test.csv,/home/rsandhu/iitm_data/bengali-male-test.csv,/home/rsandhu/iitm_data/gujarati-male-test.csv,/home/rsandhu/iitm_data/gujarati-female-test.csv,/home/rsandhu/iitm_data/hindi-male-test.csv,/home/rsandhu/iitm_data/hindi-female-test.csv,/home/rsandhu/iitm_data/kannada-male-test.csv,/home/rsandhu/iitm_data/kannada-female-test.csv,/home/rsandhu/iitm_data/malayalam-male-test.csv,/home/rsandhu/iitm_data/malayalam-female-test.csv,/home/rsandhu/iitm_data/manipuri-male-test.csv,/home/rsandhu/iitm_data/manipuri-female-test.csv,/home/rsandhu/iitm_data/rajasthani-male-test.csv,/home/rsandhu/iitm_data/rajasthani-female-test.csv,/home/rsandhu/iitm_data/tamil-female-test.csv,/home/rsandhu/iitm_data/tamil-male-test.csv --learning_rate 0.0001 --train_batch_size 24 --dev_batch_size 48 --test_batch_size 48 --display_step 0 --validation_step 1 --dropout_rate 0.2 --checkpoint_step 1 --decoder_library_path binaries/libctc_decoder_with_kenlm.so --export_dir ~/new_model
0.4.1:
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ~/deepspeech-0.4.1-checkpoint --epoch -5 --train_files /home/rsandhu/iitm_data/assamese-male-train.csv,/home/rsandhu/iitm_data/assamese-female-train.csv,/home/rsandhu/iitm_data/bengali-male-train.csv,/home/rsandhu/iitm_data/gujarati-male-train.csv,/home/rsandhu/iitm_data/gujarati-female-train.csv,/home/rsandhu/iitm_data/hindi-male-train.csv,/home/rsandhu/iitm_data/hindi-female-train.csv,/home/rsandhu/iitm_data/kannada-male-train.csv,/home/rsandhu/iitm_data/kannada-female-train.csv,/home/rsandhu/iitm_data/malayalam-male-train.csv,/home/rsandhu/iitm_data/malayalam-female-train.csv,/home/rsandhu/iitm_data/manipuri-male-train.csv,/home/rsandhu/iitm_data/manipuri-female-train.csv,/home/rsandhu/iitm_data/rajasthani-male-train.csv,/home/rsandhu/iitm_data/rajasthani-female-train.csv,/home/rsandhu/iitm_data/tamil-male-train.csv,/home/rsandhu/iitm_data/tamil-female-train.csv --dev_files /home/rsandhu/iitm_data/assamese-male-dev.csv,/home/rsandhu/iitm_data/assamese-female-dev.csv,/home/rsandhu/iitm_data/bengali-male-dev.csv,/home/rsandhu/iitm_data/gujarati-male-dev.csv,/home/rsandhu/iitm_data/gujarati-female-dev.csv,/home/rsandhu/iitm_data/hindi-male-dev.csv,/home/rsandhu/iitm_data/hindi-female-dev.csv,/home/rsandhu/iitm_data/kannada-male-dev.csv,/home/rsandhu/iitm_data/kannada-female-dev.csv,/home/rsandhu/iitm_data/malayalam-male-dev.csv,/home/rsandhu/iitm_data/malayalam-female-dev.csv,/home/rsandhu/iitm_data/manipuri-male-dev.csv,/home/rsandhu/iitm_data/manipuri-female-dev.csv,/home/rsandhu/iitm_data/rajasthani-male-dev.csv,/home/rsandhu/iitm_data/rajasthani-female-dev.csv,/home/rsandhu/iitm_data/tamil-male-dev.csv,/home/rsandhu/iitm_data/tamil-female-dev.csv --test_files /home/rsandhu/iitm_data/assamese-male-test.csv,/home/rsandhu/iitm_data/assamese-female-test.csv,/home/rsandhu/iitm_data/bengali-male-test.csv,/home/rsandhu/iitm_data/gujarati-male-test.csv,/home/rsandhu/iitm_data/gujarati-female-test.csv,/home/rsandhu/iitm_data/hindi-male-test.csv,/home/rsandhu/iitm_data/hindi-female-test.csv,/home/rsandhu/iitm_data/kannada-male-test.csv,/home/rsandhu/iitm_data/kannada-female-test.csv,/home/rsandhu/iitm_data/malayalam-male-test.csv,/home/rsandhu/iitm_data/malayalam-female-test.csv,/home/rsandhu/iitm_data/manipuri-male-test.csv,/home/rsandhu/iitm_data/manipuri-female-test.csv,/home/rsandhu/iitm_data/rajasthani-male-test.csv,/home/rsandhu/iitm_data/rajasthani-female-test.csv,/home/rsandhu/iitm_data/tamil-female-test.csv,/home/rsandhu/iitm_data/tamil-male-test.csv --learning_rate 0.0001 --train_batch_size 24 --dev_batch_size 48 --test_batch_size 48 --display_step 0 --validation_step 1 --dropout_rate 0.2 --checkpoint_step 1 --lm_alpha 0.75 --lm_beta 1.85 --export_dir ~/new_model

I hope that helps. Even though the WER has gone down but loss has increased and also if we look at the inference results of 0.4.1, they are worse than 0.3.0. Please let me know if there is any other info that you might need.

It looks like you are using junk data.

The log output indicates that the src sentences, the ground truth, is filled with texts of the form “amdmwu”, “tamenglong”… which are not English.

Sorry, that was for another thread.

The datasets include Indian accent data so it has names of some Indian places like “Tamenglong”. “amdmwu” is actually “A-M-D-M-W-U”. But, what I am not understanding is does it randomly pick data from test datasets every time and doesn’t go through all of the data? Because I have used the same data with 0.3.0 several times and it never picked these sentences for test epochs as you can see from the screenshot for 0.3.0 as well.

The printout is only the worst results, highest WER with lowest loss to detect systematic errors. And yes it goes through all the data.

Also, you can’t expect a speech recognition engine for one language to understand words from another language. If I started speaking to you in a language you did not speak, you would not be able to correctly write out what I was saying.

1 Like

Thanks @kdavis. That helps. This means it would improve the recognition for the general English (which makes up most of the datasets that I have) for Indian accent but not for names and places that are sort of regional, am I correct?
Could you please help me out with this
WARNING:root:frame length (1536) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid

I tried changing N_FFT to 1536 in deepspeech.cc but that didn’t help.

That warning should not happen unless you modified the audio feature extraction somehow. deepspeech.cc is not relevant, the training code is Python only, see util/audio.py.

@reuben I have been getting this in 0.3.0 and 0.4.1. Both these times I didn’t touch the code except for setting ‘ignore_longer_outputs_than_inputs=True’ for one of the datasets in Deepspeech.py but that shouldn’t be the cause because I have been getting this even before I made this change.

However, I just observed that the audio data I have been using is 48 KHz. Could that be the reason?

Yes.

One easy way to improve on the current results, if you have lots of English texts mixed with Indian place names, is to create a new language model + trie using these texts. That should help with the Indian place name problem.

Yes, that’s probably the reason.

You’re looking at 10 sentences out of your entire test set and deciding that 0.4.1 results are worse. The WER is lower, so the model is making less mistakes. It would really help with your results if you built a language model that also includes Indian place names.

Thanks @kdavis @reuben. I’ll try doing that and I changed the sampling rate which fixed the warning issue.

I trained with this data for -6 epochs on top of 0.4.1 and early stopping jumped in at 5th epoch. The following is the console output:

I Validation of Epoch 72 - loss: 14.327562
I Validation of Epoch 73 - loss: 13.320069
I Validation of Epoch 74 - loss: 12.927047
I Validation of Epoch 75 - loss: 12.514126
I Validation of Epoch 76 - loss: 12.529731
I Early stop triggered as (for last 4 steps) validation loss: 12.529731 with standard deviation: 0.329058 and mean: 12.920414
I FINISHED Optimization - training time: 3:42:16

I made a comparison between the performance of the 0.4.1 release model and with model I trained:
With release model:

sranjeet@sranjeet-Precision-M4800:~/mycroft-core$ deepspeech --model models/output_graph_official041.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio /media/sranjeet/Samsung_T5/audio_test_files/ted_talk/ted1_mod.wav
TensorFlow: v1.12.0-rc2-5-g1c93ca2
DeepSpeech: v0.4.0-alpha.0-0-g8b0abd5
2019-01-18 09:18:57.556545: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
the continuing language because i can this is one of these magical abilities that we cut

Using my model:

(.venv) sranjeet@sranjeet-Precision-M4800:~/mycroft-core$ deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio /media/sranjeet/Samsung_T5/audio_test_files/ted_talk/ted1_mod.wav
Loading model from file models/output_graph.pbmm
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
2019-01-18 09:45:32.442891: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.0496s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 0.164s.
Running inference.
i sing my riches can he is one of these matiahae

As you can see the inference result is totally off. I wonder why this happened. I was not expecting any improvement because the test audio had American accent English but I was not expecting the performance to degrade either.

Please, as we suggested, create a new language model + trie with Indian place names.

Do not, for now, further train the 0.4.1. model.