Discussion of new guidelines for uploaded sentence validation

I see the word sentences repeated multiple times, but a lot of (good) content I see on Common Voice does not qualify as full sentences. For instance, the first green example in the OP, “The giant dinosaurs of the Triassic”, is a noun phrase rather than a sentence, as it lacks a verb. I assume then, given the examples, that both noun phrases and sentences are OK to add. How about other constructs, like adjective phrases and verb phrases?

This isn’t just to be pedantic about linguistics – I think it’s important to be clear in the guidelines, and I was genuinely confused by this when trying to pick sentences to add via the sentence collector.

Do you have ideas on how to express this in a way regular users can understand?

Thanks!

Is it allowed? if so can someone give examples.

I wanted to update the lead post of this thread to clarify the meaning of ‘sentence’ in this context, but find I’m no longer able to edit it (no edit button appears at the bottom of the post). Are posts on Discourse locked after a certain time? If so, any ideas on where best to keep the most recent and updated version of the guidelines?

@leo any guidance here?

1 Like

There’s a wiki on Github that might be a better place, but to be honest I’d prefer to have officially-sanctioned guidelines linked prominently on the CV site. I doubt many users are seeing them here.

@nukeador Are the current guidelines sufficient for Mozilla to officially sanction?

1 Like

We should probably coordinate with @mkohler to have them in the sentence collector.

I would recommend to see if @mbranson or our copyrighter (Jay) have any recommendations on where and how to present this in a way that’s useful for people (maybe visual summary?)

@nukeador While this particular topic is about Sentence Collector, there are also guidelines for recording validation here, so I was really referring to both in my previous message.

They are, to prevent spam - but I’ve turned this and the other guidelines topic into wikis, so you should be able to edit them once again.

2 Likes

@Jean, I’ve edited the guidelines to make it clear that the uploaded texts don’t need to be “full sentences” (eg including a verb) provided that they make sense, and are grammatically correct and self-contained. I’ve suggested that

“Any phrase that you could imagine being used as a caption to an image should be OK”

I’d prefer not to rely on grammatical definitions, as those wouldn’t be easily understood by many users.

2 Likes

is a set of words that is complete that itself typically containing a subject and a predicate conveying a statement,question ,exclamanation,or command…

1 Like

Indeed. But we don’t need to bother volunteers with any of that complexity, especially as we don’t actually want to restrict to ‘sentences’ at all.

1 Like

Could you give an example?

This is kind of in a gray area, it is a full sentence.

Why does it matter if the sentence is meaningful, what exactly the AI is learning?

1 Like

Here it’s important to note that humans will be reading these sentences, so if they are meaningless, we suspect people will disengage.

See also the reasons given by @Jean, here: Validating meaningless sentences in the Sentence Collector?

Some simple examples:

  • The giant dinosaurs of the Triassic.
  • Sheet lightning.
  • Fun with flags.
  • The lure of the wild.
  • Oh no, not again!
  • The end of the rainbow.
  • An example of running stitch.
  • Under the arches.

Sorry - can’t quite understand what your concern actually is, here. If you’re wanting to categorize the texts by formal grammatical type, as I said above that doesn’t help.

@Michael_Maggs

It’s not a concern, rather a curiosity to understand what you meant by captioning an image with a phrase, the way you put it is interesting.

@Michael_Maggs @Jean @nukeador

@Jean could you elaborate more on this point?

I agree on your side, what makes me more inclined to accept those kind of sentences because:

  • I’m testing on Machine Translation for Abkhazian, it does spit out meaningless sentences so far, which is an abundant resource, in case I ran out of sentences, I can use that.

, it might be disengaging for some, on the other side, it might be engaging for others “I’m driving my pizza with an elephant on my cheese.” is fun to read, it might make someone smile.

  • It might allow the reader to be more conscious and aware when reading.

I don’t know about other languages but I do know that in English readers find it extremely difficult to read nonsense texts accurately. Readers stumble over the words, with very high rates of error.

I had a quick discussion with someone, Common Voice can be used as a learning tool to learn the language.
In that case, nonsense texts shouldn’t be included. Even though from the machine perspective it might not matter.
This also depends on usage, the scope for sentence validity can be narrowed or extended.