Seems like a clever solution – nice!
I am not very knowledgeable on speech recognition, but for instance the CTC loss, which is commonly used for this task, seems to be often augmented with a language model. Now, the language models used are typically externally trained on different (larger) corpora, so I imagine it won’t matter too much if the speech corpus contains gibberish text.
However, there are end-to-end systems that don’t use an external language model. See for instance this 2017 paper by Battenberg et al. From the second paragraph of section 3:
attention and RNN-Transducers implicitly learn a language model from the speech training corpus
For this type of setting, I can imagine that having nonsensical language would negatively affect the implicitly learned language model, and would lead to lower performance on unseen data (because the model wouldn’t have been able to learn very well what makes a sentence plausible).