Validating meaningless sentences in the Sentence Collector?

I came to the forum looking to see if anybody else had already reported this.

To give a concrete example: The architect declares to be acting on the psychoanalysts by touching sixty-six ratchets. (“L’architecte déclare agir sur les psychanalistes en touchant soixante-dix crécelles.”)

I strongly believe such sentences should be excluded for the following reasons:

  1. They confuse speakers, leading to lower quality recordings, as expressed above by @Michael_Maggs.
  2. They lower the quality of the corpus from a machine learning point of view, because:
    • The distribution of sentences contained in the corpus will no longer be representative of real spoken sentences, both in terms of grammatical structure (which is very repetitive for auto-generated sentences) and in terms of real-world word usage.
    • Many algorithms may be relying internally on a language model, which will get confused when it encounters nonsensical sentences.
2 Likes