Discussion of new guidelines for uploaded sentence validation

Michael_Maggs · July 8, 2019, 3:58pm

DRAFT GUIDELINES FOR REVIEWING UPLOADED TEXTS

A proposal to improve the Review Sentences paragraph currently found on this page: https://common-voice.github.io/sentence-collector/#/how-to

[Edited to include comments up to 8 July 2019]

Make sure the uploaded text meets the following criteria:

It must be spelled correctly (though slang spellings are allowed).
It must make sense, and be grammatically correct and self-contained.
It must be easily speakable, and in the correct language.

If the text meets the criteria, accept by clicking the “yes” button.
If the text does not meet the criteria, reject by clicking the “no” button.
If you are unsure, click the “skip” button to move on to the next one.

Examples:

Although the website is misleadingly called the “Sentence Collector” don’t worry too much whether the text meets the formal definition of a sentence. For example, it’s not necessarily a problem if the text does not include a verb. Any phrase that you could imagine being used as a caption to an image should be OK.

Reject texts with typos or accidental spelling or grammatical errors, but accept texts with slang terms and apparently intentional spelling variations. Before rejecting for spelling errors, remember that alternative spellings might be normal elsewhere in the world.

The giant dinosaurs of the Triassic.

The giant dinosaurs of the Triassic
[lack of full stop/period at the end is not considered an error]

The Giant Dinosaurs Of The Triassic.
[Accept unconventional capitalisation - it could be intentional in some contexts]

The giant dinesaurs of the Triassic.
[Spelling error]

“The giant dinosaurs of the Triassic.
[Punctuation error]

The giant dinosaurs of.
[Not grammatically correct and self contained]

The giant dinosaurs if the Triassic.
[Obvious typo for ‘of’]

Is that to many potatoes for you?
[Obvious typo for ’too’]

she said, after a pause.
[Not grammatically correct and self contained. Appears to be only part of a sentence as it starts with a lower case ’s’.]

April is the cruellest month.
[Normal British English spelling]

April is the cruelest month.
[Normal US English spelling]

Are ya gonna hit ‘em?
[Slang and unconventional terms are OK]

It was the womans bag.
[Should formally be “woman’s”, but many people now intentionally omit the apostrophe, particularly in informal contexts. Accept, as we need to capture informal as well as formal usages]

The B-B-C is a British broadcaster.
[Misuse of “B-B-C” in order to avoid the prohibition on the usual abbreviation ‘BBC’]

Joyeux Noel.
[Not in the expected language. This French text has probably been uploaded to the wrong language section]

Deinococcus radiodurans is a species of bacterium.
[Not easily speakable; too obscure and difficult for many readers]

“nuqneH”, the Klingon Captain said.
[Not in the expected language, and not easily speakable]

I’m driving my pizza with an elephant on my cheese.
[Reject meaningless texts, for example those that appear to have been computer-generated]

Michael_Maggs · April 2, 2019, 4:11pm

This thread is for community discussion of sentence reviewing guidelines in the Sentence Collector. For recording review, see Discussion of new guidelines for recording validation

sixten · April 2, 2019, 5:06pm

Here’s a question:
You recommend opensubtitles.org as a great source.
But if I look for movie subtitles aren’t the scripts copyrighted, even if it’s transcribed by some dude in a basement?
And what about translated subtitles, do the rights belong to the translator or the owner of rights for the original script?

Michael_Maggs · April 2, 2019, 7:42pm

I’m just a volunteer here, so can’t reply officially. But I do think that the advice given is wrong. Subtitles may well have copyright protection, especially if many examples are taken from a database. And translated subtitles may have copyright in the translated version as well as in the original.

dabinat · April 6, 2019, 1:59am

There are a handful of “open source” movies like Sintel, Elephant’s Dream, Big Buck Bunny, etc. I’m not sure how many are CC0 though.

The current cutoff for US copyright expiration (i.e. when creative works become public domain) is 1923, while the first “talkie” was released in 1927. So it will be quite a while before we could get large amounts of data from movie subtitles under CC0.

Bloubi · April 9, 2019, 8:04am

Hi all,

at least in the French corpus, I can see people uploading segmented pieces of sentences coming from books. I guess it comes from a restriction on sizes, but the outcome is that these segments sound somehow weird, as they can represent the middle of a full sentence. Thus there is no mistake per se, but it looks weird from a meaning perspective. For exemple: “asked John to the audience”, instead of " What do you want me to do? asked John to the audience" (that’s just an illustration, not a real sample).

Michael_Maggs · April 9, 2019, 2:57pm

Hi @Bloubi – Yes, this is happening in English as well. As suggested in the guidelines above, such ‘sentences’ should in my opinion be rejected as they make no sense, are non-grammatical, and are often difficult for volunteers to read.

It’s been suggested several times that the Sentence Collector should reject uploads that start with a lower-case letter, but that’s not been implemented yet. In the meantime, volunteers running scripts to extract text from books need to add their own check for such problems. It often happens with sentences such as

What do you want me to do? asked John to the audience. or
*“That’s brilliant!”, she said.

when the tokeniser wrongly thinks that an exclamation or question mark can only come at the end of a sentence.

Bloubi · April 9, 2019, 2:59pm

Hi @Michael_Maggs,

thanks. Meanwhile, I’ll reject them manually if they cross my path.

Cedric

nukeador · April 9, 2019, 3:02pm

About this, a sentence starting with lower case doesn’t mean the sentence is wrong by default. as I commented in this issue. We should improve split sentence detection, but not incorporating a validation that potentially can affect a lot of valid sentences.

Michael_Maggs · April 9, 2019, 3:05pm

Let’s continue to discuss the specific point at https://github.com/Common-Voice/sentence-collector/issues/202

moonhouse · May 28, 2019, 8:20pm

It is for example quite noticeable that a substantial part of the Swedish sentences come from the opensubtitles.org subtitles for the Netflix movie Budapest.

A sentence such as

Vi organiserar svensexor i Budapest för franska brudgummar.

(We organize stag parties in Budapest for French grooms.)

is quite unique and the string “Budapest” can be found 15 times in sentence-collector.json

The recommendation of opensubtitles.org should at least be with a caveat that the user should make sure that the subtitle isn’t a derivative work of a copyrighted work (not in public domain or CC0).

(It should also be noted that this type of homemade subtitles often suffer from low quality. I found missing spaces, missing letters and misspellings among the Swedish sentences. I would guess there were many more before the review process but with low quality in you are almost bound to have a few slip in.)

Jean · June 21, 2019, 2:27pm

I see the word sentences repeated multiple times, but a lot of (good) content I see on Common Voice does not qualify as full sentences. For instance, the first green example in the OP, “The giant dinosaurs of the Triassic”, is a noun phrase rather than a sentence, as it lacks a verb. I assume then, given the examples, that both noun phrases and sentences are OK to add. How about other constructs, like adjective phrases and verb phrases?

This isn’t just to be pedantic about linguistics – I think it’s important to be clear in the guidelines, and I was genuinely confused by this when trying to pick sentences to add via the sentence collector.

nukeador · June 21, 2019, 9:57pm

Do you have ideas on how to express this in a way regular users can understand?

Thanks!

daniel.abzakh · July 4, 2019, 7:12am

Is it allowed? if so can someone give examples.

Michael_Maggs · July 6, 2019, 11:44am

I wanted to update the lead post of this thread to clarify the meaning of ‘sentence’ in this context, but find I’m no longer able to edit it (no edit button appears at the bottom of the post). Are posts on Discourse locked after a certain time? If so, any ideas on where best to keep the most recent and updated version of the guidelines?

nukeador · July 6, 2019, 7:55pm

@leo any guidance here?

dabinat · July 7, 2019, 6:39pm

There’s a wiki on Github that might be a better place, but to be honest I’d prefer to have officially-sanctioned guidelines linked prominently on the CV site. I doubt many users are seeing them here.

@nukeador Are the current guidelines sufficient for Mozilla to officially sanction?

nukeador · July 7, 2019, 7:35pm

We should probably coordinate with @mkohler to have them in the sentence collector.

I would recommend to see if @mbranson or our copyrighter (Jay) have any recommendations on where and how to present this in a way that’s useful for people (maybe visual summary?)

dabinat · July 7, 2019, 9:14pm

@nukeador While this particular topic is about Sentence Collector, there are also guidelines for recording validation here, so I was really referring to both in my previous message.

leo · July 8, 2019, 9:50am

They are, to prevent spam - but I’ve turned this and the other guidelines topic into wikis, so you should be able to edit them once again.