Validating meaningless sentences in the Sentence Collector?

Hi all,

in the french corpus of the Sentence Collector, I’m seeing recently quite a bunch of grammatically correct but completly meaningless sentences (like “I’m driving my pizza with an elephant on my cheese”). What are we supposed to do with those ?

Regards,

Cedric

It’s fine, we don’t expect anything that has any sense as long as it’s readable and french

I came to the forum looking to see if anybody else had already reported this.

To give a concrete example: The architect declares to be acting on the psychoanalysts by touching sixty-six ratchets. (“L’architecte déclare agir sur les psychanalistes en touchant soixante-dix crécelles.”)

I strongly believe such sentences should be excluded for the following reasons:

  1. They confuse speakers, leading to lower quality recordings, as expressed above by @Michael_Maggs.
  2. They lower the quality of the corpus from a machine learning point of view, because:
    • The distribution of sentences contained in the corpus will no longer be representative of real spoken sentences, both in terms of grammatical structure (which is very repetitive for auto-generated sentences) and in terms of real-world word usage.
    • Many algorithms may be relying internally on a language model, which will get confused when it encounters nonsensical sentences.
2 Likes

Where do these come from? It sounds like someone wrote a script to insert random words into boilerplate sentences, just to increase the quantity.

Well, if we are starting to talk about automatically generated sentences, that’s another question. From the ~5000 validations I did on sentence collector, I did not saw such kind of sentence. Maybe someone is messing with us ?

I assume someone with DB access will be able to see who added all these French sentences. My guess is that they used either a set of templates or a PCFG-based generator, since most of the sentences are grammatically correct.

As Gregor already said on Slack, anyone is allowed to read the data source:

https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_fr/records

Once you have identified the username, you can filter by appending ?username=foo to the URL.

Thanks for that Michael. After a quick investigation, seems like it comes from user bf5man, and he added in the source field:“These are my own based on an open source tool.”

Regards,

Cedric

it’d be great if we reach out to him so we can adjust contributions

I’ve added this to the list at Discussion of new guidelines for uploaded sentence validation as an example of the type of sentence to be rejected.

Hi all,

I’m now rejecting spoken versions of the meaningless sentences. It is a pity as we are wasting the time of these volunteers. How could we remove these sentences from bf5man ?

Cedric

I can delete those, but I won’t get to before Wednesday evening earliest.

That’s great thanks !

This is done now. I’ve also added a script for me to run this, in case this should happen in the future again.