I want to create voice command dataset for english keywords and train my model with it. I don’t want the crowdsourced way to create it, is there anyway to generate it?
This is what I plan to do to generate it:
From the web I found that we can use existing audio-transcript data and locate word in respective audio files. Upon further exploring I found that there is this technique called - FA (Forced alignment). Using FA, one can locate individual word’s timestamps in audio file and then I can extract them using sox or something else.
2 Weeks ago Mozilla released FA using deepspeech. I am not sure if it can word level Forced alignment.
Before doing all this, I want to ask if anyone knows how to generate voice command dataset programatically.