Discussion of new guidelines for uploaded sentence validation

nukeador · July 7, 2019, 7:35pm

We should probably coordinate with @mkohler to have them in the sentence collector.

I would recommend to see if @mbranson or our copyrighter (Jay) have any recommendations on where and how to present this in a way that’s useful for people (maybe visual summary?)

dabinat · July 7, 2019, 9:14pm

@nukeador While this particular topic is about Sentence Collector, there are also guidelines for recording validation here, so I was really referring to both in my previous message.

leo · July 8, 2019, 9:50am

They are, to prevent spam - but I’ve turned this and the other guidelines topic into wikis, so you should be able to edit them once again.

Michael_Maggs · July 8, 2019, 12:10pm

@Jean, I’ve edited the guidelines to make it clear that the uploaded texts don’t need to be “full sentences” (eg including a verb) provided that they make sense, and are grammatically correct and self-contained. I’ve suggested that

“Any phrase that you could imagine being used as a caption to an image should be OK”

I’d prefer not to rely on grammatical definitions, as those wouldn’t be easily understood by many users.

rupertoapiong · July 8, 2019, 12:47pm

is a set of words that is complete that itself typically containing a subject and a predicate conveying a statement,question ,exclamanation,or command…

Michael_Maggs · July 8, 2019, 1:36pm

Indeed. But we don’t need to bother volunteers with any of that complexity, especially as we don’t actually want to restrict to ‘sentences’ at all.

daniel.abzakh · July 9, 2019, 7:29am

Could you give an example?

This is kind of in a gray area, it is a full sentence.

Why does it matter if the sentence is meaningful, what exactly the AI is learning?

nukeador · July 9, 2019, 7:52am

Here it’s important to note that humans will be reading these sentences, so if they are meaningless, we suspect people will disengage.

Michael_Maggs · July 9, 2019, 9:28am

See also the reasons given by @Jean, here: Validating meaningless sentences in the Sentence Collector?

Michael_Maggs · July 9, 2019, 9:42am

Some simple examples:

The giant dinosaurs of the Triassic.
Sheet lightning.
Fun with flags.
The lure of the wild.
Oh no, not again!
The end of the rainbow.
An example of running stitch.
Under the arches.

Sorry - can’t quite understand what your concern actually is, here. If you’re wanting to categorize the texts by formal grammatical type, as I said above that doesn’t help.

daniel.abzakh · July 9, 2019, 10:12am

@Michael_Maggs

It’s not a concern, rather a curiosity to understand what you meant by captioning an image with a phrase, the way you put it is interesting.

daniel.abzakh · July 9, 2019, 11:23am

@Michael_Maggs @Jean @nukeador

@Jean could you elaborate more on this point?

I agree on your side, what makes me more inclined to accept those kind of sentences because:

I’m testing on Machine Translation for Abkhazian, it does spit out meaningless sentences so far, which is an abundant resource, in case I ran out of sentences, I can use that.

, it might be disengaging for some, on the other side, it might be engaging for others “I’m driving my pizza with an elephant on my cheese.” is fun to read, it might make someone smile.

It might allow the reader to be more conscious and aware when reading.

Michael_Maggs · July 9, 2019, 11:50am

I don’t know about other languages but I do know that in English readers find it extremely difficult to read nonsense texts accurately. Readers stumble over the words, with very high rates of error.

daniel.abzakh · July 11, 2019, 2:08pm

I had a quick discussion with someone, Common Voice can be used as a learning tool to learn the language.
In that case, nonsense texts shouldn’t be included. Even though from the machine perspective it might not matter.
This also depends on usage, the scope for sentence validity can be narrowed or extended.

Jean · July 15, 2019, 3:21pm

Seems like a clever solution – nice!

I am not very knowledgeable on speech recognition, but for instance the CTC loss, which is commonly used for this task, seems to be often augmented with a language model. Now, the language models used are typically externally trained on different (larger) corpora, so I imagine it won’t matter too much if the speech corpus contains gibberish text.

However, there are end-to-end systems that don’t use an external language model. See for instance this 2017 paper by Battenberg et al. From the second paragraph of section 3:

attention and RNN-Transducers implicitly learn a language model from the speech training corpus

For this type of setting, I can imagine that having nonsensical language would negatively affect the implicitly learned language model, and would lead to lower performance on unseen data (because the model wouldn’t have been able to learn very well what makes a sentence plausible).

daniel.abzakh · July 16, 2019, 7:10am

Very interesting research.
Using Google translator to translate “I’m driving my pizza with an elephant on my cheese.” to Russian, I get “Я веду пиццу со слоном на моем сыре.” which is correct.
MT is able to translate the text, even if it’s not plausible, it doesn’t care if it’s nonsensical.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · July 16, 2019, 10:15am

I’ll give two cents in this topic, I think that this might help the model overall, since the dataset needs to be as diverse as possible and @daniel.abzakh is right, you could say something out of the ordinary as an example most systems don’t recognize [“um”, “ahh”], which are part of the sentence nonetheless, perhaps even intonation could be part of the dataset, questions don’t seem to understood right, if I say “How are you?”, I wanna see as I intended not “how are you”, to finish I think this project is also awesome for language learning, at some point it could be aimed for people in this niche.
Thanks.

sinumade · September 19, 2020, 2:17pm

Hi.
I have two questions.

Is it good to have sentences that pronounce words longer?
It’s common in casual conversation. For example, in Japanese, such as

お願いしまーす。 (onegai-shimaasu)
ここだよお。 (koko-dayoo)

Second, sometimes I see “conversation” style sentences in collector. For example,

“Where are you going?” “The park”
“I want to be a plane” “Not a bird?”

It’s an inappropriate sentence for “single” speaker, I think. Or does it not matter?

Adrijaned · September 20, 2020, 8:20pm

Regarding your second point: as long as both the sentences aren’t taken as one i.e. “Where are you going?” is a single sentence, and “The park” is a different sentence, that is completely fine, and actually preferred to “single speaker” style sentences. (As that is the most common type of voice data algorithms may find in the wild)

sinumade · September 25, 2020, 4:01pm

Ah, Thanks @Adrijaned !

No problem for the speech algorithm. So, does that mean that the sentence can be used for Common Voice, I can give it a thumbs up in the review?
I rejected some of it... I should have asked the question sooner.

In the above example, that means it doesn't matter if it's two sentences.

I want to take a day off tomorrow. But I have to go to school.
For some reason, I don't feel the urge to rush. We have plenty of time.

There are many sentences like that in Japanese language collector and source text.