Sentence Approval Upvotes

mkohler · April 23, 2021, 5:21pm

Wikipedia articles are not Public Domain, they are shared under the Creative Commons Attribution-ShareAlike License, which makes it unsuitable for inclusion in Common Voice by default. However we are allowed to include a maximum of 3 sentences per article as per a legal agreement. This process needs to be run by the Common Voice project to guarantee that no more than allowed are extracted.

This is the reason why we have the mentioned Sentence Extractor. Here’s how the process looks like:

Rules get developed and verified by the Community
Once the error rate is low enough the rules get merged
This triggers an automatic sentence extraction which guarantees that only 3 sentences per article are used
The resulting output is added to Common Voice without going through the Sentence Collector

This has a few advantages that will also benefit you, including not needing to review every single sentence but just a sample. Happy to answer any questions you might have either here or in the #common-voice-sentence-extractor:mozilla.org room on Matrix.

As we can’t guarantee that legal requirement in Sentence Collector, I had to remove these sentences.