Discussion of new guidelines for uploaded sentence validation

(Michael Maggs) #1


A proposal to improve the Review Sentences paragraph currently found on this page: https://common-voice.github.io/sentence-collector/#/how-to

[Edited to include comments up to 2 April 2019]

Make sure the sentence meets the following criteria:

  • It must be spelled correctly (though slang spellings are allowed).
  • It must make sense and be grammatically correct.
  • It must be easily speakable, and in the correct language.

If the sentence meets the criteria, accept by clicking the “yes” button.
If the sentence does not meet the criteria, reject by clicking the “no” button.
If you are unsure, click the “skip” button to move on to the next one.


Reject texts with typos or accidental spelling or grammatical errors, but accept texts with slang terms and apparently intentional spelling variations. Before rejecting for spelling errors, remember that alternative spellings might be normal elsewhere in the world.

:white_check_mark: The giant dinosaurs of the Triassic.

:white_check_mark: The giant dinosaurs of the Triassic
[lack of full stop/period at the end is not considered an error]

:white_check_mark: The Giant Dinosaurs Of The Triassic.
[Accept unconventional capitalisation - it could be intentional in some contexts]

:x: The giant dinesaurs of the Triassic.
[Spelling error]

:x: “The giant dinosaurs of the Triassic.
[Punctuation error]

:x: The giant dinosaurs of.
[Not grammatical]

:x: The giant dinosaurs if the Triassic.
[Obvious typo for ‘of’]

:x: Is that to many potatoes for you?
[Obvious typo for ’too’]

:x: she said, after a pause.
[Not grammatical. Appears to be only part of a sentence as it starts with a lower case ’s’.]

:white_check_mark: April is the cruellest month.
[Normal British English spelling]

:white_check_mark: April is the cruelest month.
[Normal US English spelling]

:white_check_mark: Are ya gonna hit ‘em?
[Slang and unconventional terms are OK]

:white_check_mark: It was the womans bag.
[Should formally be “woman’s”, but many people now intentionally omit the apostrophe, particularly in informal contexts. Accept, as we need to capture informal as well as formal usages]

:x: The B-B-C is a British broadcaster.
[Misuse of “B-B-C” in order to avoid the prohibition on the usual abbreviation ‘BBC’]

:x: Joyeux Noel.
[Not in the expected language. This French sentence has probably been uploaded to the wrong language section]

:x: “nuqneH”, the Klingon Captain said.
[Not in the expected language, and not easily speakable]

:x: I’m driving my pizza with an elephant on my cheese.
[Reject meaningless sentences, for example those that appear to have been computer-generated]

Discussion of new guidelines for recording validation
Validating meaningless sentences in the Sentence Collector?
Rules for German sentence contribution / Deutsche Sprache
Discussion of new guidelines for recording validation
(Michael Maggs) #2

This thread is for community discussion of sentence reviewing guidelines in the Sentence Collector. For recording review, see Discussion of new guidelines for recording validation


Here’s a question:
You recommend opensubtitles.org as a great source.
But if I look for movie subtitles aren’t the scripts copyrighted, even if it’s transcribed by some dude in a basement?
And what about translated subtitles, do the rights belong to the translator or the owner of rights for the original script?

(Michael Maggs) #4

I’m just a volunteer here, so can’t reply officially. But I do think that the advice given is wrong. Subtitles may well have copyright protection, especially if many examples are taken from a database. And translated subtitles may have copyright in the translated version as well as in the original.


There are a handful of “open source” movies like Sintel, Elephant’s Dream, Big Buck Bunny, etc. I’m not sure how many are CC0 though.

The current cutoff for US copyright expiration (i.e. when creative works become public domain) is 1923, while the first “talkie” was released in 1927. So it will be quite a while before we could get large amounts of data from movie subtitles under CC0.


Hi all,

at least in the French corpus, I can see people uploading segmented pieces of sentences coming from books. I guess it comes from a restriction on sizes, but the outcome is that these segments sound somehow weird, as they can represent the middle of a full sentence. Thus there is no mistake per se, but it looks weird from a meaning perspective. For exemple: “asked John to the audience”, instead of " What do you want me to do? asked John to the audience" (that’s just an illustration, not a real sample).

(Michael Maggs) #7

Hi @Bloubi – Yes, this is happening in English as well. As suggested in the guidelines above, such ‘sentences’ should in my opinion be rejected as they make no sense, are non-grammatical, and are often difficult for volunteers to read.

It’s been suggested several times that the Sentence Collector should reject uploads that start with a lower-case letter, but that’s not been implemented yet. In the meantime, volunteers running scripts to extract text from books need to add their own check for such problems. It often happens with sentences such as

What do you want me to do? asked John to the audience. or
*“That’s brilliant!”, she said.

when the tokeniser wrongly thinks that an exclamation or question mark can only come at the end of a sentence.


Hi @Michael_Maggs,

thanks. Meanwhile, I’ll reject them manually if they cross my path.


(Rubén Martín) #9

About this, a sentence starting with lower case doesn’t mean the sentence is wrong by default. as I commented in this issue. We should improve split sentence detection, but not incorporating a validation that potentially can affect a lot of valid sentences.

(Michael Maggs) #10

Let’s continue to discuss the specific point at https://github.com/Common-Voice/sentence-collector/issues/202