Problem : converging to a wrong model


(Mac 71128) #1

I am trying to train deepspeech for non-English audios. I collect a small dataset of audio files based on this tutorial [TUTORIAL : How I trained a specific french model to control my robot]. I have 5000-seconds audio files along with their transcripts in my dataset.

I trained a model using 70% of the dataset:
python DeepSpeech.py --train_files “./data/train.csv” --dev_files “./data/dev.csv” --test_files “./data/test.csv” --alphabet_config_path “./data/alphabet.txt” --lm_binary_path “./data/lm.binary” --lm_trie_path “./data/trie” --export_dir “./mymodels” --checkpoint_dir “./checkpoint” --validation_step 2 --max_to_keep 10
(I can share a jupyter notebook if you want to see the output.)

During training, the training loss stays between 130-150. When I run some tests on the final model, it predicts just a null string or a character even for a training sample. It seems that my model learned nothing. Can you give me some advice? How much data should I have to feed the model?
I wonder if my training set is big enough? Would you please let me know which parameters I should tune to overcome the problem.


(Lissyx) #2

Well, obviously you are relying on defaults parameters. You need to explore turning knobs like learning rate or width of the network, etc. It all depends on your dataset as well.


(Reuben Morais) #3

Also, I don’t think inputs that are 5000 seconds long are ever going to work. That’s 250000 RNN steps! You should try using voice activity detection to split that into smaller chunks. Much smaller, around 10s or less per training sample. That will also help with using larger batch sizes and speed up your training.


(Mac 71128) #4

Sorry for being unclear. In total, I have 5000 seconds data, which is a summation of 759 audio files with the max length of 10 seconds each.