In order to create more dataset for deepspeech, we are trying to break the audio data we have on words and then rearrange them (with sufficient buffers in between).
For example, if we have an audio with the text:
“hey how are you feeling today”
and if we cut small chunks of this audio with following translations:
“hey”
“how”
“are”
“you”
“feeling”
“today”
can we rearrange them like:
“hey today you feeling are how”
does this kind of arrangement affect the language model ?
and what if we just remove some words from the original sequence, like:
“hey are feeling today”
also considering that we have a lot of data and somewhere a similar fully correct sentence is also being captured correctly?
The language model works by “down voting” sequences of words that are less likely and “up voting” sequences of words that are more likely, boosting the recognition of likely phrases.
For example if there was some audio could have been transcribed as
How to wreck a nice beach.
or as…
How to recognize speech.
the language model would select the second as it’s a more likely sentence in English.