Discussion of new guidelines for uploaded sentence validation

Is it allowed? if so can someone give examples.

I wanted to update the lead post of this thread to clarify the meaning of ‘sentence’ in this context, but find I’m no longer able to edit it (no edit button appears at the bottom of the post). Are posts on Discourse locked after a certain time? If so, any ideas on where best to keep the most recent and updated version of the guidelines?

@leo any guidance here?

1 Like

There’s a wiki on Github that might be a better place, but to be honest I’d prefer to have officially-sanctioned guidelines linked prominently on the CV site. I doubt many users are seeing them here.

@nukeador Are the current guidelines sufficient for Mozilla to officially sanction?

1 Like

We should probably coordinate with @mkohler to have them in the sentence collector.

I would recommend to see if @mbranson or our copyrighter (Jay) have any recommendations on where and how to present this in a way that’s useful for people (maybe visual summary?)

@nukeador While this particular topic is about Sentence Collector, there are also guidelines for recording validation here, so I was really referring to both in my previous message.

They are, to prevent spam - but I’ve turned this and the other guidelines topic into wikis, so you should be able to edit them once again.

2 Likes

@Jean, I’ve edited the guidelines to make it clear that the uploaded texts don’t need to be “full sentences” (eg including a verb) provided that they make sense, and are grammatically correct and self-contained. I’ve suggested that

“Any phrase that you could imagine being used as a caption to an image should be OK”

I’d prefer not to rely on grammatical definitions, as those wouldn’t be easily understood by many users.

2 Likes

is a set of words that is complete that itself typically containing a subject and a predicate conveying a statement,question ,exclamanation,or command…

1 Like

Indeed. But we don’t need to bother volunteers with any of that complexity, especially as we don’t actually want to restrict to ‘sentences’ at all.

1 Like

Could you give an example?

This is kind of in a gray area, it is a full sentence.

Why does it matter if the sentence is meaningful, what exactly the AI is learning?

1 Like

Here it’s important to note that humans will be reading these sentences, so if they are meaningless, we suspect people will disengage.

See also the reasons given by @Jean, here: Validating meaningless sentences in the Sentence Collector?

Some simple examples:

  • The giant dinosaurs of the Triassic.
  • Sheet lightning.
  • Fun with flags.
  • The lure of the wild.
  • Oh no, not again!
  • The end of the rainbow.
  • An example of running stitch.
  • Under the arches.

Sorry - can’t quite understand what your concern actually is, here. If you’re wanting to categorize the texts by formal grammatical type, as I said above that doesn’t help.

@Michael_Maggs

It’s not a concern, rather a curiosity to understand what you meant by captioning an image with a phrase, the way you put it is interesting.

@Michael_Maggs @Jean @nukeador

@Jean could you elaborate more on this point?

I agree on your side, what makes me more inclined to accept those kind of sentences because:

  • I’m testing on Machine Translation for Abkhazian, it does spit out meaningless sentences so far, which is an abundant resource, in case I ran out of sentences, I can use that.

, it might be disengaging for some, on the other side, it might be engaging for others “I’m driving my pizza with an elephant on my cheese.” is fun to read, it might make someone smile.

  • It might allow the reader to be more conscious and aware when reading.

I don’t know about other languages but I do know that in English readers find it extremely difficult to read nonsense texts accurately. Readers stumble over the words, with very high rates of error.

I had a quick discussion with someone, Common Voice can be used as a learning tool to learn the language.
In that case, nonsense texts shouldn’t be included. Even though from the machine perspective it might not matter.
This also depends on usage, the scope for sentence validity can be narrowed or extended.

Seems like a clever solution – nice!

I am not very knowledgeable on speech recognition, but for instance the CTC loss, which is commonly used for this task, seems to be often augmented with a language model. Now, the language models used are typically externally trained on different (larger) corpora, so I imagine it won’t matter too much if the speech corpus contains gibberish text.

However, there are end-to-end systems that don’t use an external language model. See for instance this 2017 paper by Battenberg et al. From the second paragraph of section 3:

attention and RNN-Transducers implicitly learn a language model from the speech training corpus

For this type of setting, I can imagine that having nonsensical language would negatively affect the implicitly learned language model, and would lead to lower performance on unseen data (because the model wouldn’t have been able to learn very well what makes a sentence plausible).

1 Like

Very interesting research.
Using Google translator to translate “I’m driving my pizza with an elephant on my cheese.” to Russian, I get “Я веду пиццу со слоном на моем сыре.” which is correct.
MT is able to translate the text, even if it’s not plausible, it doesn’t care if it’s nonsensical.