I came to the forum looking to see if anybody else had already reported this.
To give a concrete example: The architect declares to be acting on the psychoanalysts by touching sixty-six ratchets. (“L’architecte déclare agir sur les psychanalistes en touchant soixante-dix crécelles.”)
I strongly believe such sentences should be excluded for the following reasons:
- They confuse speakers, leading to lower quality recordings, as expressed above by @Michael_Maggs.
- They lower the quality of the corpus from a machine learning point of view, because:
- The distribution of sentences contained in the corpus will no longer be representative of real spoken sentences, both in terms of grammatical structure (which is very repetitive for auto-generated sentences) and in terms of real-world word usage.
- Many algorithms may be relying internally on a language model, which will get confused when it encounters nonsensical sentences.