If we join very small chunks of audio to create bigger chunks of acceptable size, do we need to keep in mind the integrety of the language being spoken for the language model to work properly?

In order to create more dataset for deepspeech, we are trying to break the audio data we have on words and then rearrange them (with sufficient buffers in between).

For example, if we have an audio with the text:

“hey how are you feeling today”

and if we cut small chunks of this audio with following translations:

“hey”
“how”
“are”
“you”
“feeling”
“today”

can we rearrange them like:

“hey today you feeling are how”

does this kind of arrangement affect the language model ?

and what if we just remove some words from the original sequence, like:

“hey are feeling today”

also considering that we have a lot of data and somewhere a similar fully correct sentence is also being captured correctly?

Yes.

The language model works by “down voting” sequences of words that are less likely and “up voting” sequences of words that are more likely, boosting the recognition of likely phrases.

For example if there was some audio could have been transcribed as

How to wreck a nice beach.

or as…

How to recognize speech.

the language model would select the second as it’s a more likely sentence in English.

what about the second option of just removing few words from the original sentence in order, like in the below option:

“hey are feeling today”

considering that “hey how are you feeling today” will be a part of the dataset and is likely to have an ample number of occurrences in the dataset.

Will this approach be as bad as the previous one or will it be convincing enough to get it a try?

This will have the same problem.