in the french corpus of the Sentence Collector, I’m seeing recently quite a bunch of grammatically correct but completly meaningless sentences (like “I’m driving my pizza with an elephant on my cheese”). What are we supposed to do with those ?
I came to the forum looking to see if anybody else had already reported this.
To give a concrete example: The architect declares to be acting on the psychoanalysts by touching sixty-six ratchets. (“L’architecte déclare agir sur les psychanalistes en touchant soixante-dix crécelles.”)
I strongly believe such sentences should be excluded for the following reasons:
They confuse speakers, leading to lower quality recordings, as expressed above by @Michael_Maggs.
They lower the quality of the corpus from a machine learning point of view, because:
The distribution of sentences contained in the corpus will no longer be representative of real spoken sentences, both in terms of grammatical structure (which is very repetitive for auto-generated sentences) and in terms of real-world word usage.
Many algorithms may be relying internally on a language model, which will get confused when it encounters nonsensical sentences.
Where do these come from? It sounds like someone wrote a script to insert random words into boilerplate sentences, just to increase the quantity.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Well, if we are starting to talk about automatically generated sentences, that’s another question. From the ~5000 validations I did on sentence collector, I did not saw such kind of sentence. Maybe someone is messing with us ?
I assume someone with DB access will be able to see who added all these French sentences. My guess is that they used either a set of templates or a PCFG-based generator, since most of the sentences are grammatically correct.
Thanks for that Michael. After a quick investigation, seems like it comes from user bf5man, and he added in the source field:“These are my own based on an open source tool.”
Regards,
Cedric
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
it’d be great if we reach out to him so we can adjust contributions
I’m now rejecting spoken versions of the meaningless sentences. It is a pity as we are wasting the time of these volunteers. How could we remove these sentences from bf5man ?