I found that state-of-the-art DeepSpeech is performing better in ASR. So, I want to build an ASR system for isolated words using deepspeech. Input to the system will be audio of a single word at a time and output will be from those words that I used during training. I followed the steps from these tutorials:
DeepSpeech Playbook | deepspeech-playbook (mozilla.github.io)
Welcome to DeepSpeech’s documentation! — DeepSpeech 0.9.3 documentation
and trained the model for few words using colab research platform
but same data set used with kaldi and CMUSphinx performs better:
kaldi WER 4.5%
CMUSphinx WER 13.8%
DeepSpeech test WER 80-90% (Training loss: 14.78, validation loss: 29.56)
My queries regarding DeepSpeech are:
- is there any different structure setup for isolated words?
- In CMUSphinx and Kaldi Error words displays the output as a complete valid word in the dictionary but DeepSpeech display sequence of character even if there is no meaning of that word exp. valid word is Princess and the output is rinces.
how can I train a DeepSpeech RNN model for ISOLATED Words?