Discussion of new guidelines for uploaded sentence validation

@nukeador While this particular topic is about Sentence Collector, there are also guidelines for recording validation here, so I was really referring to both in my previous message.

They are, to prevent spam - but I’ve turned this and the other guidelines topic into wikis, so you should be able to edit them once again.

2 Likes

@Jean, I’ve edited the guidelines to make it clear that the uploaded texts don’t need to be “full sentences” (eg including a verb) provided that they make sense, and are grammatically correct and self-contained. I’ve suggested that

“Any phrase that you could imagine being used as a caption to an image should be OK”

I’d prefer not to rely on grammatical definitions, as those wouldn’t be easily understood by many users.

2 Likes

is a set of words that is complete that itself typically containing a subject and a predicate conveying a statement,question ,exclamanation,or command…

1 Like

Indeed. But we don’t need to bother volunteers with any of that complexity, especially as we don’t actually want to restrict to ‘sentences’ at all.

1 Like

Could you give an example?

This is kind of in a gray area, it is a full sentence.

Why does it matter if the sentence is meaningful, what exactly the AI is learning?

1 Like

Here it’s important to note that humans will be reading these sentences, so if they are meaningless, we suspect people will disengage.

See also the reasons given by @Jean, here: Validating meaningless sentences in the Sentence Collector?

Some simple examples:

  • The giant dinosaurs of the Triassic.
  • Sheet lightning.
  • Fun with flags.
  • The lure of the wild.
  • Oh no, not again!
  • The end of the rainbow.
  • An example of running stitch.
  • Under the arches.

Sorry - can’t quite understand what your concern actually is, here. If you’re wanting to categorize the texts by formal grammatical type, as I said above that doesn’t help.

@Michael_Maggs

It’s not a concern, rather a curiosity to understand what you meant by captioning an image with a phrase, the way you put it is interesting.

@Michael_Maggs @Jean @nukeador

@Jean could you elaborate more on this point?

I agree on your side, what makes me more inclined to accept those kind of sentences because:

  • I’m testing on Machine Translation for Abkhazian, it does spit out meaningless sentences so far, which is an abundant resource, in case I ran out of sentences, I can use that.

, it might be disengaging for some, on the other side, it might be engaging for others “I’m driving my pizza with an elephant on my cheese.” is fun to read, it might make someone smile.

  • It might allow the reader to be more conscious and aware when reading.

I don’t know about other languages but I do know that in English readers find it extremely difficult to read nonsense texts accurately. Readers stumble over the words, with very high rates of error.

I had a quick discussion with someone, Common Voice can be used as a learning tool to learn the language.
In that case, nonsense texts shouldn’t be included. Even though from the machine perspective it might not matter.
This also depends on usage, the scope for sentence validity can be narrowed or extended.

Seems like a clever solution – nice!

I am not very knowledgeable on speech recognition, but for instance the CTC loss, which is commonly used for this task, seems to be often augmented with a language model. Now, the language models used are typically externally trained on different (larger) corpora, so I imagine it won’t matter too much if the speech corpus contains gibberish text.

However, there are end-to-end systems that don’t use an external language model. See for instance this 2017 paper by Battenberg et al. From the second paragraph of section 3:

attention and RNN-Transducers implicitly learn a language model from the speech training corpus

For this type of setting, I can imagine that having nonsensical language would negatively affect the implicitly learned language model, and would lead to lower performance on unseen data (because the model wouldn’t have been able to learn very well what makes a sentence plausible).

1 Like

Very interesting research.
Using Google translator to translate “I’m driving my pizza with an elephant on my cheese.” to Russian, I get “Я веду пиццу со слоном на моем сыре.” which is correct.
MT is able to translate the text, even if it’s not plausible, it doesn’t care if it’s nonsensical.

I’ll give two cents in this topic, I think that this might help the model overall, since the dataset needs to be as diverse as possible and @daniel.abzakh is right, you could say something out of the ordinary as an example most systems don’t recognize [“um”, “ahh”], which are part of the sentence nonetheless, perhaps even intonation could be part of the dataset, questions don’t seem to understood right, if I say “How are you?”, I wanna see as I intended not “how are you”, to finish I think this project is also awesome for language learning, at some point it could be aimed for people in this niche.
Thanks.

Hi.
I have two questions.

Is it good to have sentences that pronounce words longer?
It’s common in casual conversation. For example, in Japanese, such as

  • お願いします。 (onegai-shimaasu)
  • ここだよ。 (koko-dayoo)

Second, sometimes I see “conversation” style sentences in collector. For example,

  • “Where are you going?” “The park”
  • “I want to be a plane” “Not a bird?”

It’s an inappropriate sentence for “single” speaker, I think. Or does it not matter?

Regarding your second point: as long as both the sentences aren’t taken as one i.e. “Where are you going?” is a single sentence, and “The park” is a different sentence, that is completely fine, and actually preferred to “single speaker” style sentences. (As that is the most common type of voice data algorithms may find in the wild)

Ah, Thanks @Adrijaned !

No problem for the speech algorithm. So, does that mean that the sentence can be used for Common Voice, I can give it a thumbs up in the review?
I rejected some of it... I should have asked the question sooner.

In the above example, that means it doesn't matter if it's two sentences.

  • I want to take a day off tomorrow. But I have to go to school.
  • For some reason, I don't feel the urge to rush. We have plenty of time.

There are many sentences like that in Japanese language collector and source text.

Intentional accent of the sentence

日本語版: 文章の「意圖的な訛り」について

This kind of contraction or slang is possible. So what about writing "accents" as they are?
Just as each person has a different way of speaking, if we are going to write about it, we should have a different way of writing it.

For example, in Japan, there are people who refer to "あそこAsoko" as "っこAkko". I heard it in a game play video. The person who posted it is from the Kansai region of Japan. It was the first time I heard it, but I knew right away that he was talking about "あそこ". However, if it had been a sentence, I would have thought it was a "mistake".
Japanese language has what is called "standard". I'm from Kanto, so I haven't been particularly influenced by the dialect. I think so. But it's also something that I won't understand unless other people "hear" it.
I may be writing biased Japanese. That makes me vaguely uneasy (because I've already added hundreds of them to the Collector).

I saw the discussion about the English language. I thought the English language has its own challenges because of its widespread, pervasive nature.
What is a "familiar" way of speaking (or writing) for some people is the "wrong" way of speaking for others.
This may be inevitable. That "difference" is the joy of language.

It's inevitable that people will write with an accent without knowing it's there. So is it OK to intentionally write accents?
And should the "accent" of a sentence be rejected?

Uh, does unconventional terms include an accent?
If we were to accept this, wouldn't we be making people who aren't familiar with it speak with an accent? (Of course it is possible to keep a record of this.)
Yes, if we're going to collect people's "speech patterns" over a wide area, a sentence with an accent is appropriate enough, and is just as important an "asset" as a "voice". Normally, people don't try to write down their accents, you know. After all, the Common Voice only lets us speak what is written!