Hello. I have been running DeepSpeech over a few configurations with the Google Commands Dataset. There are 65000 total one-second 16000 framerate wav files. This equates to about 18 hours of audio. Now, I have ran the model for short periods of time (2 epoch) on 0.7.4 and my model at inference is able to predict some letters (e’s and o’s) without a scorer (with scorer it predicts blanks).
I have tried running the same data on 0.7.3 in the past with 10 epochs of training and received blank inference with and without a scorer.
I am curious of a few things:
-
Is this probably not enough hours of audio for good results?
-
Are one second, one word utterances not optimal for the RNN architecture? Even using the out of the box Deepspeech model and scorer, the WER I received on a random sample of 1500 Google Commands files was ~47%. Interestingly enough, results seemed to be better without the scorer (I imagine this has to do with the probability dependencies generated from the DeepSpeech corpus). Kenlm doesn’t support unigram order so it seems that using a scorer on one word utterances is not ideal.
-
I know these amounts of epochs are low but hardware is limited at the moment. I mainly just want to prove that I can train DeepSpeech from scratch and receive tangible results (good or bad but not blank inference) and make sure it isn’t my configuration or setup that is an issue.
Aside: In both my training trials (2 epochs and 10 epochs) the training loss gets down to about ~7.25 and test/dev loss sticks around 30. I have yet to start experimenting with Dropout and other hyperparameters due to the limited hardware at the moment.
I have faith in the DeepSpeech architecture and want to ensure that I am utilizing and understanding this software in and out before I assess performance entirely.
Thank you