TV speech recognition

javad · March 22, 2018, 9:56am

Hi Guys!

Actually currently I am working on a Japanese speech recognition system for TV, with simple commands like turn off or turn on and etc. and the total vocabulary that used into the commands and I would like to train, are about 600 words. based on this, how many hours of data do you think be enough for creating a model? and my second question is do you think that DeepSpeech is suitable for my project(PS. previously I used Julius speech recognition but it was not really good, so because of that I think it is better to use a deep neural networks)?

lissyx · March 25, 2018, 3:03pm

You should have a look at the work done by @elpimous_robot with his robot TUTORIAL : How I trained a specific french model to control my robot because your usecase looks close to me.

elpimous_robot · March 25, 2018, 3:28pm

Hi Javad.
Deepspeech is the best solution (for now)

Be more precise in your sequences requests :
Ex : “turn the light on”, "can you power on…"
Sequence length, single words, noizzy environment…

A good starting point is 8 to 10 wav per sequence (with SMALL changes in wav records)