Generate voice command dataset

subhash · September 2, 2019, 9:55am

I want to create voice command dataset for english keywords and train my model with it. I don’t want the crowdsourced way to create it, is there anyway to generate it?

This is what I plan to do to generate it:

From the web I found that we can use existing audio-transcript data and locate word in respective audio files. Upon further exploring I found that there is this technique called - FA (Forced alignment). Using FA, one can locate individual word’s timestamps in audio file and then I can extract them using sox or something else.
2 Weeks ago Mozilla released FA using deepspeech. I am not sure if it can word level Forced alignment.

Before doing all this, I want to ask if anyone knows how to generate voice command dataset programatically.

lissyx · September 2, 2019, 11:22am

Likely that @Tilman_Kamp can help

The best way to ask is what do you want to achieve ? Voice command ? If so you can try just re-using the english model as-is and setup a command-specific language model. We tested that to provide very good results.

subhash · September 2, 2019, 12:06pm

Let me elaborate on what I am after. I want to build a voice command use case on Android app using deepspeech. The default model for android works fine for the use case but is slow. It takes 4 seconds for 2 seconds voice command. This is not going to help me. I want to reduce the latency. A similar discussion happened in this thread where you suggested to reduce complexity of the model (by reducing n_hidden=2048 to lower value, I plan to use 256) and retrain it. I believe the latency should reduce with this new model. Now I need data to train it. I think I can not use the large dataset that the deepspeech main model is trained with (Correct me if wrong). Hence I thought to generate the voice command dataset from voice corpus.

On the voice command use case, I would be using this in various applications and hence I can not have fix command set. For each application, I might have a new command set. Hence I am looking forward for a way to generate my training voice command dataset.