I am trying to train deepspeech for non-English audios. I collect a small dataset of audio files based on this tutorial [TUTORIAL : How I trained a specific french model to control my robot]. I have 5000-seconds audio files along with their transcripts in my dataset.
I trained a model using 70% of the dataset:
python DeepSpeech.py --train_files “./data/train.csv” --dev_files “./data/dev.csv” --test_files “./data/test.csv” --alphabet_config_path “./data/alphabet.txt” --lm_binary_path “./data/lm.binary” --lm_trie_path “./data/trie” --export_dir “./mymodels” --checkpoint_dir “./checkpoint” --validation_step 2 --max_to_keep 10
(I can share a jupyter notebook if you want to see the output.)
During training, the training loss stays between 130-150. When I run some tests on the final model, it predicts just a null string or a character even for a training sample. It seems that my model learned nothing. Can you give me some advice? How much data should I have to feed the model?
I wonder if my training set is big enough? Would you please let me know which parameters I should tune to overcome the problem.